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Many  engineering  systems  can  be  characterized  as  a  large  scale  collection  of  inter¬ 
acting  subsystems  each  having  access  to  local  information,  making  local  decisions, 
having  local  interactions  with  neighbors,  and  seeking  to  optimize  local  objectives  that 
may  well  be  in  conflict  with  other  subsystems.  The  analysis  and  design  of  such  con¬ 
trol  systems  falls  under  the  broader  framework  of  “complex  and  distributed  systems”. 
Other  names  include  “multi-agent  control,”  “cooperative  control,”  “networked  con¬ 
trol,”  as  well  as  “team  theory”  or  “swarming.”  Regardless  of  the  nomenclature,  the 
central  challenge  remains  the  same.  That  is  to  derive  desirable  collective  behaviors 
through  the  design  of  individual  agent  control  algorithms.  The  potential  benefits  of 
distributed  decision  architectures  include  the  opportunity  for  real-time  adaptation  (or 
self-organization)  and  robustness  to  dynamic  uncertainties  such  as  individual  compo¬ 
nent  failures,  non- stationary  environments,  and  adversarial  elements.  These  benefits 
come  with  significant  challenges,  such  as  the  complexity  associated  with  a  potentially 
large  number  of  interacting  agents  and  the  analytical  difficulties  of  dealing  with  over¬ 
lapping  and  partial  information. 
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This  dissertation  focuses  on  dealing  with  the  distributed  nature  of  decision  mak¬ 
ing  and  information  processing  through  a  non-cooperative  game-theoretic  formulation. 
The  interactions  of  a  distributed/multi-agent  control  system  are  modeled  as  a  non- 
cooperative  game  among  agents  with  the  desired  collective  behavior  being  expressed 
as  a  Nash  equilibrium.  In  large  scale  multi-agent  systems,  agents  are  inherently  lim¬ 
ited  in  both  their  observational  and  computational  capabilities.  Therefore,  this  disser¬ 
tation  focuses  on  learning  algorithms  that  can  accommodate  these  limitations  while 
still  guaranteeing  convergence  to  a  Nash  equilibrium.  Furthermore,  in  this  dissertation 
we  illustrate  a  connection  between  the  fields  of  game  theory  and  cooperative  control 
and  develop  several  suitable  learning  algorithms  for  a  wide  variety  of  cooperative  con¬ 
trol  problems.  This  connection  establishes  a  framework  for  designing  and  analyzing 
multi-agent  systems.  We  demonstrate  the  potential  benefits  of  this  framework  on  sev¬ 
eral  cooperative  control  problems  including  dynamic  sensor  coverage,  consensus,  and 
distributing  routing  over  a  network,  as  well  as  the  mathematical  puzzle  Sudoku. 
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CHAPTER  1 


Overview 

Many  engineering  systems  can  be  characterized  as  a  large  scale  collection  of  inter¬ 
acting  subsystems  each  having  access  to  local  information,  making  local  decisions, 
having  local  interactions  with  neighbors,  and  seeking  to  optimize  local  objectives  that 
may  well  be  in  conflict  with  other  subsystems.  A  representative  sampling  includes  au¬ 
tonomous  vehicle  teams,  cooperative  robotics,  distributed  computing,  electronic  com¬ 
merce,  wireless  networks,  sensor  networks,  traffic  control,  social  networks,  and  com¬ 
bat  systems. 

The  analysis  and  design  of  such  control  systems  falls  under  the  broader  framework 
of  “complex  and  distributed  systems”.  Other  names  include  “multi-agent  control,” 
“cooperative  control,”  “networked  control,”  as  well  as  “team  theory”  or  “swarming.” 
Regardless  of  the  nomenclature,  the  central  challenge  remains  the  same.  That  is  to 
derive  desirable  collective  behaviors  through  the  design  of  individual  agent  control 
algorithms.  The  potential  benefits  of  distributed  decision  architectures  include  the 
opportunity  for  real-time  adaptation  (or  self-organization)  and  robustness  to  dynamic 
uncertainties  such  as  individual  component  failures,  non- stationary  environments,  and 
adversarial  elements.  These  benefits  come  with  significant  challenges,  such  as  the 
complexity  associated  with  a  potentially  large  number  of  interacting  agents  and  the 
analytical  difficulties  of  dealing  with  overlapping  and  partial  information. 

This  dissertation  focuses  on  dealing  with  the  distributed  nature  of  decision  mak¬ 
ing  and  information  processing  through  a  non-cooperative  game-theoretic  formulation. 
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The  interactions  of  a  distributed/multi-agent  control  system  are  modeled  as  a  non- 
cooperative  game  among  agents,  with  the  desired  collective  behavior  being  expressed 
as  a  Nash  equilibrium.  The  emphasis  is  on  simple  learning  algorithms  that  guarantee 
convergence  to  a  Nash  equilibrium.  Furthermore,  the  algorithms  must  have  minimal 
computational  requirements  to  accommodate  implementation  in  a  wide  variety  of  en¬ 
gineered  systems. 

The  need  for  simple  learning  algorithms  can  be  motivated  by  looking  at  the  prob¬ 
lem  of  distributed  routing  over  a  network.  In  such  a  problem,  there  is  a  large  number 
of  self  interested  players  seeking  to  utilize  a  common  network.  Since  the  available  re¬ 
sources  in  the  network  are  finite,  players’  objectives  are  very  much  in  conflict  with  one 
another.  The  sheer  quantity  of  available  information  makes  centralized  dissemination 
or  processing  infeasible.  When  modeling  the  players’  interaction  as  a  non-cooperative 
game,  the  central  issue  involves  how  players  make  decisions.  Or  more  precisely,  what 
information  do  players  need  to  base  their  decisions  on  so  as  to  guarantee  some  form  of 
a  collective  behavior?  For  example,  does  each  player  need  to  know  the  routing  strate¬ 
gies  of  all  other  players  or  would  some  form  of  aggregate  information  be  acceptable? 

Motivated  by  the  inherent  information  restrictions  in  the  problem  of  distributed 
routing  over  networks,  in  Chapter  3  we  consider  multi-player  repeated  games  involv¬ 
ing  a  large  number  of  players  with  large  strategy  spaces  and  enmeshed  utility  struc¬ 
tures.  In  these  “large-scale”  games,  players  are  inherently  faced  with  limitations  in 
both  their  observational  and  computational  capabilities.  Accordingly,  players  in  large- 
scale  games  need  to  make  their  decisions  using  algorithms  that  accommodate  limi¬ 
tations  in  information  gathering  and  processing.  This  disqualifies  some  of  the  well 
known  decision  making  models  such  as  “Fictitious  Play”  (FP)  [MS96a],  in  which  each 
player  must  monitor  the  individual  actions  of  every  other  player  and  must  optimize 
over  a  high  dimensional  probability  space. 
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In  this  chapter,  we  analyze  the  properties  of  the  learning  algorithm  Joint  Strategy 
Fictitious  Play  (JSFP),  a  close  variant  of  FP  We  demonstrate  that  JSFP  alleviates  both 
the  informational  and  computational  burden  of  FP  Furthermore,  we  introduce  JSFP 
with  inertia,  i.e.,  a  probabilistic  reluctance  to  change  strategies,  and  establish  the  con¬ 
vergence  to  a  pure  Nash  equilibrium  in  all  generalized  ordinal  potential  games  in  both 
cases  of  averaged  or  exponentially  discounted  historical  data.  We  illustrate  JSFP  with 
inertia  on  the  specific  class  of  congestion  games,  a  subset  of  generalized  ordinal  poten¬ 
tial  games.  In  particular,  we  illustrate  the  main  results  on  a  distributed  traffic  routing 
problem. 

In  Chapter  4,  we  extend  the  results  of  JSFP  by  introducing  an  entire  class  of  learn¬ 
ing  algorithms  that  can  accommodate  such  observational  and  processing  restrictions. 
To  that  end,  we  build  upon  the  idea  of  no-regret  algorithms  [HMOO]  to  strengthen  the 
performance  guarantees  for  implementation  in  multi-agent  systems.  No-regret  algo¬ 
rithms  have  been  proposed  to  control  a  wide  variety  of  multi-agent  systems.  The  appeal 
of  no-regret  algorithms  is  that  they  are  easily  implementable  in  large  scale  multi-agent 
systems  because  players  make  decisions  using  only  regret  based  information.  Further¬ 
more,  there  are  existing  results  proving  that  the  collective  behavior  will  asymptotically 
converge  to  a  set  of  points  of  “no-regret”  in  any  game.  We  illustrate,  through  a  sim¬ 
ple  example,  that  no-regret  points  need  not  reflect  desirable  operating  conditions  for  a 
multi-agent  system. 

Multi-agent  systems  often  exhibit  an  additional  structure,  i.e.,  being  weakly  acyclic, 
that  has  not  been  exploited  in  the  context  of  no-regret  algorithms.  In  this  chapter,  we 
introduce  a  modification  of  the  traditional  no-regret  algorithms  by  (i)  exponentially 
discounting  the  memory  and  (ii)  bringing  in  a  notion  of  inertia  in  players’  decision 
process.  We  show  how  these  modifications  can  lead  to  an  entire  class  of  regret  based 
algorithms  that  provide  almost  sure  convergence  to  a  pure  Nash  equilibrium  in  any 
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weakly  acyclic  game. 

The  last,  and  most  informationally  restrictive,  class  of  learning  algorithms  that 
we  will  consider  in  this  dissertation  are  payoff  based  algorithms.  In  such  a  scenario, 
players  only  have  access  to  (i)  the  action  they  played  and  (ii)  the  utility  (possibly 
noisy)  they  received.  In  a  transportation  network,  this  translates  to  drivers  only  having 
information  about  the  congestion  actually  experienced.  Drivers  are  now  unaware  of 
the  traffic  conditions  on  any  alternative  routes,  which  was  previously  a  requirement 
for  the  implementation  of  either  JSFP  or  any  regret  based  learning  algorithm. 

In  Chapter  5,  we  focus  on  payoff  based  learning  algorithms  on  the  specific  class  of 
weakly  acyclic  games.  We  introduce  three  different  payoff  based  processes  for  increas¬ 
ingly  general  scenarios  and  prove  that  after  a  sufficiently  large  number  of  stages,  player 
actions  constitute  a  Nash  equilibrium  at  any  stage  with  arbitrarily  high  probability.  The 
first  learning  algorithm,  called  Safe  Experimentation ,  guarantees  convergence  to  an  op¬ 
timal  Nash  equilibrium  in  any  identical  interest  game.  Such  an  equilibrium  is  called 
optimal  because  it  maximizes  the  payoff  to  all  players.  The  second  learning  algorithm, 
called  Simple  Experimentation ,  guarantees  convergence  to  a  Nash  equilibrium  in  any 
weakly  acyclic  game.  The  third  learning  algorithm,  called  Sample  Experimentation, 
guarantees  convergence  to  a  Nash  equilibrium  in  any  weakly  acyclic  game  even  in  the 
presence  of  noisy  utility  functions. 

The  second  topic  of  Chapter  5  is  centered  around  the  inefficiency  of  Nash  equilib¬ 
ria  in  routing  problems.  It  is  well  known  that  a  Nash  equilibrium  may  not  represent 
a  desirable  operating  point  in  a  routing  problem  as  it  typically  does  not  minimize  the 
total  congestion  on  the  network.  Motivated  by  this  inefficiency  concern,  we  derive  an 
approach  for  modifying  player  utility  functions  through  tolls  and  incentives  in  conges¬ 
tion  games,  a  special  class  of  weakly  acyclic  games,  to  guarantee  that  a  centralized 
objective  can  be  realized  as  a  Nash  equilibrium.  We  illustrate  this  equilibrium  refine- 
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ment  method  on  a  well  studied  distributed  routing  problem  known  as  Braess’  Paradox. 

In  the  following  chapter,  the  focus  shifts  from  the  development  of  suitable  learning 
algorithms  to  understanding  how  one  would  design  a  multi-agent  systems  for  a  coop¬ 
erative  control  problem.  In  particular,  how  would  a  global  planner  design  each  agent’s 
local  utility  function  such  that  a  central  objective  could  be  realized  as  the  outcome 
of  a  repeated  non-cooperative  game?  We  seek  to  answer  this  question  by  highlight¬ 
ing  a  connection  between  cooperative  control  problems  and  potential  games.  This 
connection  to  potential  games  provides  a  structural  framework  with  which  to  study 
cooperative  control  problems  and  suggests  an  approach  for  utility  design.  However, 
we  would  like  to  note  that  utility  design  for  multi-agent  systems  is  still  very  much  an 
open  issue. 

In  Chapter  6,  we  present  a  view  of  cooperative  control  using  the  language  of  learn¬ 
ing  in  games.  We  review  the  game  theoretic  concepts  of  potential  games  and  weakly 
acyclic  games  and  demonstrate  how  several  cooperative  control  problems  such  as  con¬ 
sensus,  dynamic  sensor  coverage,  and  even  the  mathematical  puzzle  Sudoku  can  be 
formulated  in  these  settings.  Motivated  by  this  connection,  we  build  upon  game  theo¬ 
retic  concepts  to  better  accommodate  a  broader  class  of  cooperative  control  problems. 
In  particular,  we  introduce  two  extensions  of  the  learning  algorithm  Spatial  Adaptive 
Play.  The  first  extension  called  binary  Restricted  Spatial  Adaptive  Play  accommodates 
restricted  action  sets  caused  by  limitations  in  agent  capabilities.  The  second  exten¬ 
sion  called  Spatial  Adaptive  Play  with  Group  Based  Decisions  accommodates  group 
based  collaborations  in  the  decision  making  process.  Furthermore,  we  also  introduce 
a  new  class  of  games,  called  sometimes  weakly  acyclic  games,  for  time-varying  util¬ 
ity  functions  and  action  sets,  and  provide  distributed  algorithms  for  convergence  to  an 
equilibrium. 

Lastly,  we  illustrate  the  potential  benefits  of  this  connection  on  several  cooper- 
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ative  control  problems.  For  the  consensus  problem,  we  demonstrate  that  consensus 
can  be  reached  even  in  an  environment  with  non-convex  obstructions.  For  the  func¬ 
tional  consensus  problem,  we  demonstrate  an  approach  that  will  allow  agents  to  reach 
consensus  on  a  specific  consensus  point  which  is  some  function  of  the  initial  condi¬ 
tions.  For  the  dynamic  sensor  coverage  problem,  we  demonstrate  how  autonomous 
sensors  can  distribute  themselves  using  only  local  information  in  such  a  way  as  to 
maximize  the  probability  of  detecting  a  particular  event  over  a  given  mission  space. 
Lastly,  we  demonstrate  how  the  popular  mathematical  game  of  Sudoku  can  be  mod¬ 
eled  as  a  noncooperative  game  and  solved  using  the  learning  algorithms  discussed  in 
this  dissertation. 

1.1  Main  Contributions  of  this  Dissertation 

To  summarize,  we  will  now  restate  the  main  contributions  of  this  dissertation. 

•  We  introduce  the  learning  algorithm  Joint  Strategy  Fictitious  Play  with  inertia 
and  establish  almost  sure  convergence  to  a  pure  Nash  equilibrium  in  all  gener¬ 
alized  ordinal  potential  games  in  both  cases  of  averaged  or  exponentially  dis¬ 
counted  historical  data. 

•  We  introduce  a  modification  of  the  traditional  no-regret  algorithms  by  (i)  ex¬ 
ponentially  discounting  the  memory  and  (ii)  bringing  in  a  notion  of  inertia  in 
players’  decision  process.  We  show  how  these  modifications  can  lead  to  an  en¬ 
tire  class  of  regret  based  algorithms  that  provide  almost  sure  convergence  to  a 
pure  Nash  equilibrium  in  any  weakly  acyclic  game. 

•  We  introduce  the  payoff  based  algorithm  Safe  Experimentation  and  establish 
almost  sure  convergence  to  an  optimal  Nash  equilibrium  in  any  identical  interest 
game. 
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•  We  introduce  the  payoff  based  algorithm  Simple  Experimentation  and  establish 
almost  sure  convergence  to  a  pure  Nash  equilibrium  in  any  weakly  acyclic  game. 

•  We  introduce  the  payoff  based  algorithm  Sample  Experimentation  and  establish 
almost  sure  convergence  to  a  pure  Nash  equilibrium  in  any  weakly  acyclic  game 
even  in  the  presence  of  noisy  utility  functions. 

•  We  derive  an  approach  for  modifying  player  utility  functions  through  tolls  and 
incentives  in  congestion  games  to  guarantee  that  a  centralized  objective  can  be 
realized  as  a  Nash  equilibrium. 

•  We  establish  a  connection  between  potential  games  and  cooperative  control  and 
demonstrate  the  potential  benefits  of  this  connection  on  several  cooperative  con¬ 
trol  problems  including  dynamic  sensor  coverage,  consensus,  and  distributing 
routing  over  a  network,  as  well  as  the  mathematical  puzzle  Sudoku. 

•  We  derive  an  equivalent  definition  for  weakly  acyclic  games  that  explicitly  high¬ 
lights  the  connection  between  weakly  acyclic  and  potential  games. 

•  We  introduce  an  extension  of  the  learning  algorithm  Spatial  Adaptive  Play,  called 
binary  Restricted  Spatial  Adaptive  Play,  to  accommodate  restricted  action  sets 
caused  by  agent  limitations.  We  establish  probabilistic  convergence  to  an  action 
profile  that  maximizes  the  potential  function  in  any  potential  game. 

•  We  introduce  an  extension  of  the  learning  algorithm  Spatial  Adaptive  Play,  called 
Spatial  Adaptive  Play  with  Group  Based  Decisions,  to  accommodate  group  based 
collaborations  in  the  decision  making  process.  We  establish  probabilistic  conver¬ 
gence  to  an  action  profile  that  maximizes  the  potential  function  in  any  potential 
game. 
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•  We  introduce  a  new  class  of  games,  called  sometimes  weakly  acyclic  games,  for 
time-varying  utility  functions  and  action  sets,  and  provide  distributed  algorithms 
for  almost  sure  convergence  to  a  universal  Nash  equilibrium. 
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CHAPTER  2 


Background 

In  this  section,  we  will  present  a  background  of  the  game  theoretic  concepts  used  in  this 
dissertation.  We  refer  the  readers  to  [FT91,  You98,  You05]  for  a  more  comprehensive 
review. 

2.1  Finite  Strategic-Form  Games 

We  consider  a  finite  strategic-form  game  with  n-player  set  V  :=  {Vi, . . . .  Vn }  where 
each  player  Vi  G  V  has  an  action  set  A,  and  a  utility  function  U,  :  >1  — ?  M  where 
A  =  Ai  x  •  •  •  x  An-  We  will  refer  to  a  finite  strategic-form  game  as  just  a  game  and 
we  will  sometimes  use  a  single  symbol,  e.g.,  G,  to  represent  the  entire  game,  i.e.,  the 
player  set,  V,  action  sets,  At,  and  utility  functions  [/*. 

An  example  of  a  two  player  game  is  illustrated  in  matrix  form  in  Figure  2.1.  In  this 
game,  each  player  has  two  actions  or  strategies  and  a  utility  function  represented  by 
the  payoff  matrix.  Once  each  player  has  selected  his  action,  both  players  receive  their 
associated  reward.  For  example,  if  player  1  choose  Top  and  player  2  choose  Down , 
player  1  would  receive  a  reward  of  2  while  player  2  would  receive  a  reward  of  1 . 

For  an  action  profile  a  =  (ai,  a2, ....  an)  e  A,  let  «_8  denote  the  profile  of  player 
actions  other  than  player  Vt,  i.e., 

Qi—i  {O'!;  ■  ■  ■  ■  CLi—  1;  j  tln }  • 
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Player  2  Player  2 

chooses  Up  chooses  Down 


Player  1 
chooses  Top 


Player  1 

chooses  Bottom 


0,0 

2,1 

1,2 

0,0 

Payoff  Matrix 


Figure  2.1:  Example  of  a  Finite  Strategic-Form  Game 


With  this  notation,  we  will  sometimes  write  a  profile  a  of  actions  as  (a;,  a_,.).  Sim¬ 
ilarly,  we  may  write  Ufa )  as  Ufa^  a_,).  Furthermore,  let  A-,  =  W-p.^-p.Ai  de¬ 
note  the  set  of  possible  collective  actions  of  all  players  other  than  player  Vi  and  let 
V-i  =  {Pi, . . . ,  . .  ■ ,  Vn}  denote  the  set  of  players  other  than  player  Vt. 


2.2  Forms  of  Equilibrium 

In  this  section  we  will  introduce  three  forms  of  equilibrium  that  will  be  discussed  in 
this  dissertation:  Nash  equilibrium,  correlated  equilibrium  (CE),  and  coarse  correlated 
equilibrium  (CCE). 

2.2.1  Nash  Equilibrium 

The  most  well  known  form  of  an  equilibrium  is  the  Nash  equilibrium. 

Definition  2.2.1  (Pure  Nash  Equilibrium).  An  action  profile  a*  E  A  is  called  a  pure 
Nash  equilibrium  if  for  all  players  Vi  G  'P, 

Ufa^aff]  =  max  Ufa^  a*_f).  (2.1) 
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Furthermore,  if  the  above  condition  is  satisfied  with  a  unique  maximizer  for  every 
player  Vi  G  V,  then  a*  is  called  a  strict  Nash  equilibrium. 

A  Nash  equilibrium  represents  a  scenario  for  which  no  player  has  an  incentive  to 
unilaterally  deviate. 

The  concept  of  Nash  equilibrium  also  extends  to  mixed  strategy  spaces.  Let  the 
strategy  of  player  Vt  be  defined  as  p,  G  A(.A,),  where  A (*4*)  is  the  set  of  probability 
distributions  over  the  finite  set  of  actions  Aj.  We  will  adopt  the  convention  that  pf 
represents  the  probability  that  player  V%  will  select  action  a*  and  ffJa  1-  If  all 

players  V,  G  V  play  independently  according  to  their  personal  strategy  G  A  {Af), 
then  the  expected  utility  of  player  V%  for  strategy  p,  is  defined  as 

Ui(pi,p-i )  =  Uticfpfpf  ...pln, 

aeA 

where  p-,  =  {p\ , . . . ,  p(_  ] ,  pi+  \ , . . . ,  pn}  denotes  the  collection  of  strategies  of  players 
other  than  player  V,. 

Definition  2.2.2  (Nash  Equilibrium).  A  strategy  profile  p*  —  {p*,  ■■■ ,  P*,  }  is  called  a 
Nash  equilibrium  if  for  all  players  V,  G  V, 

UfipfpN)  =  max  UfipiipN).  (2.2) 

PieA  (Ai) 

2.2.2  Correlated  Equilibrium 

In  this  section  we  will  define  a  broader  class  of  equilibria  for  which  there  may  be  corre¬ 
lations  among  the  players.  To  that  end,  let  z  G  A  (A)  denote  a  probability  distribution 
over  the  set  of  joint  actions  A.  We  will  adopt  the  convention  that  za  is  the  probability 
of  the  joint  action  a  and  Yha&A  =  1-  In  the  special  case  that  all  players  V,  G  V  play 
independently  according  to  their  personal  strategy  p,  G  A  {Af),  as  was  the  case  in  the 
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definition  of  the  Nash  equilibrium,  then 


■ya  Tfl'lria2  'nCLn 

Z  Pi  P 2  •  •  •  Pn  i 

where  a  =  (oq,  a2,  ■ . . ,  an). 

Definition  2.2.3  (Correlated  Equilibrium).  The  probability  distribution  z  is  a  corre¬ 
lated  equilibrium  if  for  all  players  V,  G  V  and  for  all  actions  a, ,  a!i  G  At, 

Ui(a'i,a-i)z(ai’a-i\  (2.3) 

ci — iGvA — x  ci — — x 

To  motivate  this  definition  consider  the  following  scenario.  First,  a  joint  action 
a  e  A  is  randomly  drawn  according  to  the  probability  distribution  z  G  A(«4).  Next, 
each  player  is  informed  of  only  his  particular  action  a*,  but  not  the  actions  of  the  other 
players.  Finally,  each  player  is  given  the  opportunity  to  change  his  action.  The  condi¬ 
tion  for  correlated  equilibrium  in  (2.3)  states  that  each  player  TVs  conditional  expected 
payoff  for  action  a*  is  at  least  as  good  as  his  conditional  expected  payoff  for  any  other 
action  a'  f  a,i.  In  other  words,  a  probability  distribution  z  is  a  correlated  equilibrium 
if  and  only  if  no  player  would  seek  to  change  their  action  from  the  outcome,  randomly 
drawn  according  to  z,  even  after  his  part  has  been  revealed. 

Notice  that  all  Nash  equilibria  are  in  fact  correlated  equilibria. 

2.2.3  Coarse  Correlated  Equilibrium 

We  will  now  relax  the  requirements  on  correlated  equilibrium.  Before  doing  so,  we 
will  discuss  marginal  distributions.  Given  the  joint  distributions  G  A  {A),  the  marginal 
distribution  of  all  players  other  than  player  Vi  is 

a'eA 

Note  that  s_t  is  a  well  defined  probability  distribution  in  A(  A_j). 
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Definition  2.2.4  (Coarse  Correlated  Equilibrium).  The  probability  distribution  z  is 
a  coarse  correlated  equilibrium  if  for  all  players  Vi  &  V  and  for  all  actions  a'  G  A,, 


Y^,  Ufa)za  >  Y,  Ufa^a^z^.  (2.4) 

CL^lA  d—i^A—  i 

To  motivate  this  definition,  consider  the  following  scenario  which  differs  slightly 
from  the  correlated  equilibrium  scenario.  Before  the  joint  action  a  is  drawn,  each 
player  Vt  is  given  the  opportunity  to  opt  out,  in  which  case  the  player  can  select  any 
action  a*  G  A,  that  he  wishes.  If  the  player  does  not  opt  out,  he  commits  himself  to 
playing  his  part  of  the  action-tuple  a  randomly  drawn  according  to  the  distribution  z. 
In  words,  a  distribution  z  is  a  coarse  correlated  equilibrium  if  under  this  scenario  no 
player  would  choose  to  opt  out  given  that  all  other  players  opt  to  stay  in. 


If  the  joint  distribution  z  is  a  correlated  equilibrium,  then  we  know  that  for  any 
action  «'  G  A, 


Y  Y  UiKa-i)^’^ 

o-i^Ai  a_iGv4._i 


>  V  v  t/,(o',a_,  )>•“->, 

Cli^Ai  (L  —  i(zj\—i 

ct— iGv4_i  (ii€.Ai 


This  implies  that  for  any  action  a'  G  A, 

YU^)za^  Y 

CL^lA  CL—i^:A—i 

Therefore,  all  correlated  equilibria,  and  hence  Nash  equilibria,  are  in  fact  coarse  corre¬ 
lated  equilibria  as  illustrated  in  Figure  2.2.  Under  the  condition  that  all  players  select 
their  action  independently,  as  was  the  case  in  the  definition  of  the  Nash  equilibrium, 
then  the  definition  of  correlated,  coarse  correlated,  and  Nash  equilibria  are  all  equiva¬ 
lent. 
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Figure  2.2:  Relationship  Between  Nash,  Correlated,  and  Coarse  Correlated  Equilibria. 

2.2.4  Equilibrium  Comparison 

The  main  difference  between  Nash,  correlated,  and  coarse  correlated  equilibria  is 
whether  a  player  is  committed  conditionally  or  unconditionally  to  a  random  draw  of 
a  given  joint  distribution  z  G  A  (.A).  Table  2.1,  taken  from  [You05],  summarizes  the 
main  differences  between  the  three  forms  of  equilibria. 

Conditional  Participation  Unconditional  Participation 
Independent  Probabilities  Nash  Nash 

Correlated  Probabilities  Correlated  Coarse  Correlated 

Table  2. 1 :  Relationship  Between  Nash,  Correlated,  and  Coarse  Correlated  Equilibria. 


We  will  now  present  a  simple  two  player  example,  from  [You05],  to  highlight 
the  differences  between  the  set  of  Nash  equilibria  and  the  set  of  correlated  or  coarse 
correlated  equilibria.  Note  that  the  set  of  correlated  equilibria  and  the  set  of  coarse 
correlated  equilibria  are  equivalent  in  two  player  games. 

Consider  the  following  two  player  game  with  payoff  matrix  as  illustrated  if  Fig¬ 
ure  2.3.  For  any  joint  action,  the  first  entry  is  the  payoff  for  player  1  and  the  second 
entry  is  the  payoff  for  player  2.  For  example,  Ui(L,L)  =  1  and  C/2(T,L)  =  1. 
Let  z  =  { zLL ,  zLR,  zRL,  zLL}  be  a  probability  distribution  over  the  joint  action  space 
A.  =  {LL,  LR,  RL ,  RR}. 
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Figure  2.3:  Example  of  an  Identical  Interest  Game 

In  this  example,  there  are  two  strict  Nash  equilibria,  (L,  L )  and  (Ii.  R ).  Further¬ 
more,  there  is  one  mixed  Nash  equilibrium,  p[  =  =  1/2  and  pf  =  pR  =  1/2.  A 

joint  distribution  z  is  a  correlated  equilibrium  if  and  only  if  the  off-diagonal  probabil¬ 
ities  do  not  exceed  the  diagonal  probabilities,  i.e., 

ma x{zLR,zRL}  <  min {zLL,zRR}. 

Therefore,  the  set  of  correlated  equilibria  is  significantly  larger  than  the  set  of  Nash 
equilibria. 

2.3  Classes  of  Games 

In  this  dissertation  we  will  consider  four  classes  of  games:  identical  interest  games, 
potential  games,  congestion  games,  and  weakly  acyclic  games.  Each  class  of  games 
imposes  a  restriction  on  the  admissible  utility  functions. 
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2.3.1  Identical  Interest  Games 


The  most  restrictive  class  of  games  that  we  will  review  in  this  dissertation  is  identical 
interest  games.  In  such  a  game,  the  players’  utility  functions  {[0}f=1  are  chosen  to  be 
the  same.  That  is,  for  some  function  0  :  A  — >  M, 

Ui(a)  =  0(a), 

for  every  Vl  G  V  and  for  every  a  G  A.  It  is  easy  to  verify  that  all  identical  inter¬ 
est  games  have  at  least  one  pure  Nash  equilibrium,  namely  any  action  profile  a  that 
maximizes  0(a).  An  example  of  an  identical  interest  game  is  illustrated  in  Figure  2.3. 

2.3.2  Potential  Games 

A  significant  generalization  of  an  identical  interest  game  is  a  potential  game.  In  a 
potential  game,  the  change  in  a  player’s  utility  that  results  from  a  unilateral  change 
in  strategy  equals  the  change  in  the  global  utility.  Specifically,  there  is  a  function 
0  :  A  — *  M.  such  that  for  every  player  V,  G  V.  for  every  a_t  G  A-i,  and  for  every 

a',  a"  G  Ai, 

Uj ( al ,  a — 0  f0(a^ ,  a_0  0(a^,  a_0  0(^^  >  a_0.  (2.5) 

When  this  condition  is  satisfied,  the  game  is  called  a  potential  game  with  the  potential 
function  0.  It  is  easy  to  see  that  in  potential  games,  any  action  profile  maximizing  the 
potential  function  is  a  pure  Nash  equilibrium,  hence  every  potential  game  possesses  at 
least  one  such  equilibrium. 

An  example  of  a  two  player  potential  game  with  associated  potential  function  is 
illustrated  if  Figure  2.4. 

We  will  also  consider  a  more  general  class  of  potential  games  known  as  generalized 
ordinal  potential  games.  In  generalized  ordinal  potential  games  there  is  a  function 
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Figure  2.4:  Example  of  a  Potential  Game  with  Potential  Function 

<f>  :  A  — >  M  such  that  for  every  player  V,  E  V.  for  every  a_,  E  A-i,  and  for  every 

a'i:  a"  E  Ai, 

Ui(a'i:a-i)  -  Ui(a”,a-i )  >  0  =>■  0(a',a_;)  -  0(a",a_j)  >  0. 

2.3.3  Congestion  Games 

Congestion  games  are  a  specific  class  of  games  in  which  player  utility  functions  have 
a  special  structure. 

In  order  to  define  a  congestion  game,  we  must  specify  the  action  set,  Ai,  and 
utility  function,  Ui(-),  of  each  player.  Towards  this  end,  let  1Z  denote  a  finite  set  of 
“resources”.  For  each  resource  r  E  1Z.  there  is  an  associated  “congestion  function” 

cr  :  {0,1,2,...}  — >  M 

that  reflects  the  cost  of  using  the  resource  as  a  function  of  the  number  of  players  using 
that  resource. 

The  action  set,  A%,  of  each  player,  V,  ,  is  defined  as  the  set  of  resources  available  to 
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player  V,,  i.e., 


A  C  2n, 

where  2R  denotes  the  set  of  subsets  of  7Z.  Accordingly,  an  action,  a,  G  At,  reflects  a 
selection  of  (multiple)  resources,  a*  C  1Z.  A  player  is  “using”  resource  r  if  r  G  a*.  For 
an  action  profile  a  G  A  let  ay  (a)  denote  the  total  number  of  players  using  resource 
r,  i.e.,  |{i  :  r  G  a*}|.  In  a  congestion  game,  the  utility  of  player  Vi  using  resources 
indicated  by  a,:  depends  only  on  the  total  number  of  players  using  the  same  resources. 
More  precisely,  the  utility  of  player  Vi  is  defined  as 

uiia)  =  y ^cr(q>(a)).  (2.6) 

r£<n 

Any  congestion  game  with  utility  functions  as  in  (2.6)  is  a  potential  game  [Ros73]  with 
potential  function 

ar  (a) 

0(a)  =  Cr (fc)-  ^2-7) 

fc=l 

In  fact,  every  congestion  game  is  a  potential  game  and  every  finite  potential  game  is 
isomorphic  to  a  congestion  game  [MS96b]. 

2.3.4  Weakly  Acyclic  Games 

Consider  any  finite  game  G  with  a  set  A  of  action  profiles.  A  better  reply  path  is  a 
sequence  of  action  profiles  a1,  a2,  ...,aL  such  that,  for  every  1  <  t  <  L  —  1,  there 
is  exactly  one  player  Vlt  such  that  i)  a\  ^  af+1,  ii)  a'_lf  =  ae'A ,  and  iii)  UjJ//)  < 
Uie(a{+1).  In  other  words,  one  player  moves  at  a  time,  and  each  time  a  player  moves 
he  increases  his  own  utility. 

Suppose  now  that  G  is  a  potential  game  with  potential  function  <f b.  Starting  from 
an  arbitrary  action  profile  a  G  A,  construct  a  better  reply  path  a  =  a1,  a2, ...,  aL  until 
it  can  no  longer  be  extended.  Note  first  that  such  a  path  cannot  cycle  back  on  itself, 
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because  </>  is  strictly  increasing  along  the  path.  Since  A  is  finite,  the  path  cannot  be 
extended  indefinitely.  Hence,  the  last  element  in  a  maximal  better  reply  path  from  any 
joint  action,  a,  must  be  a  Nash  equilibrium  of  G. 

This  idea  may  be  generalized  as  follows.  The  game  G  is  weakly  acyclic  if  for  any 
a  e  A,  there  exists  a  better  reply  path  starting  at  a  and  ending  at  some  pure  Nash 
equilibrium  of  G  [You98,  You05].  Potential  games  are  special  cases  of  weakly  acyclic 
games. 

An  example  of  a  two  player  weakly  acyclic  game  is  illustrated  in  Figure  2.5. 


A 


J 


A 

1 


Weakly  Acyclic 
Under  Better  Replies 


Not  Weakly  Acyclic 
Under  Better  Replies 


Figure  2.5:  Example  of  a  Weakly  Acyclic  Game 


2.4  Repeated  Games 

In  a  repeated  game,  at  each  time  t  E  {0,1,2,...},  each  player  Vi  E  V  simultane¬ 
ously  chooses  an  action  a;(t)  E  At  and  receives  the  utility  Ui(a(t ))  where  a(t)  :  = 
(ai(£), . . . ,  an(t)).  Each  player  Vi  E  V  chooses  his  action  a*(f)  at  time  t  simultane¬ 
ously  according  to  a  probability  distribution  Pi(t),  which  we  will  refer  to  as  the  strategy 


19 


of  player  V,  at  time  t.  A  player’s  strategy  at  time  t  can  rely  only  on  observations  from 
times  {0,l,2,...,f  —  1}.  Different  learning  algorithms  are  specified  by  both  the  as¬ 
sumptions  on  available  information  and  the  mechanism  by  which  the  strategies  are 
updated  as  information  is  gathered. 

We  will  review  three  main  classes  of  learning  algorithms  in  this  dissertation:  full 
information,  virtual  payoff  based,  and  payoff  based.  For  a  detailed  review  of  learning 
in  games  we  direct  the  reader  to  [FL98,  You98,  You05,  HS98,  Wei95,  Sam97]. 

2.4.1  Full  Information  Learning  Algorithms 

The  most  informationally  sophisticated  class  of  learning  algorithms  is  full  information. 
In  full  information  learning  algorithms,  each  player  knows  the  functional  form  of  his 
utility  function  and  is  capable  of  observing  the  actions  of  all  other  players  at  every  time 
step.  The  strategy  adjustment  mechanism  of  player  Vi  can  be  written  in  the  general 
form 

Pi(t)  =  Fi(a( 0), ...,  a(t  -  1);  11,). 

In  this  setting,  players  may  develop  probabilistic  models  for  the  actions  of  other 
players  using  past  observations.  Based  off  these  models,  players  may  seek  to  maximize 
some  form  of  an  expected  utility.  An  example  of  a  learning  algorithm,  or  strategy 
adjustment  mechanism,  of  this  form  is  the  well  known  fictitious  play  [MS96a].  We 
will  review  fictitious  play  in  Section  3.2.1. 

2.4.2  Virtual  Payoff  Based  Learning  Algorithms 

We  will  now  relax  the  requirements  of  full  information  learning  algorithms.  In  virtual 
payoff  based  algorithms,  players  are  now  unaware  of  the  structural  form  of  their  utility 
function.  Furthermore,  players  also  are  not  capable  of  observing  the  actions  of  all 
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players.  However,  players  are  endowed  with  the  ability  to  assess  the  utility  that  they 
would  have  received  for  alternative  action  choices.  For  example,  suppose  that  the 
action  played  at  time  t  is  a(t).  In  virtual  payoff  based  dynamics,  each  player  Vi  with 
action  set  Ai  =  { a) , ...,  a\A^}  has  access  to  the  following  information: 

a(t )  :  , 

Ui{a}Ai\a_i(t)) 

where  \At\  denotes  the  cardinality  of  the  action  set  A,. 

The  strategy  adjustment  mechanism  of  player  Vl  can  be  written  in  the  general  form 

Pi{t)  —  Cl— i(0))}aieA)  ■  ■  •  t  l))}aig_4;  ). 

An  example  of  a  learning  algorithm,  or  strategy  adjustment  mechanism,  of  this  form 
is  the  well  known  regret  matching  [HMOO].  We  will  review  regret  matching  in  Sec¬ 
tion  4.2.  Virtual  payoff  based  learning  algorithms  will  be  the  focus  of  Chapters  3  and 
4. 

2.4.3  Payoff  Based  Learning  Algorithms 

Payoff  based  learning  algorithms  are  the  most  informationally  restrictive  class  of  learn¬ 
ing  algorithms.  Now,  players  only  have  access  to  (i)  the  action  they  played  and  (ii)  the 
utility  (possibly  noisy)  they  received.  In  this  setting,  the  strategy  adjustment  mecha¬ 
nism  of  player  Vi  takes  on  the  form 

Pi{t)  =  Fj(WO),  (7i(a(0))}, ...,  {di(t  -  1),  Ui(a(t  -  1))}).  (2.8) 

We  will  discuss  payoff  based  learning  algorithms  extensively  in  Chapter  5. 
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CHAPTER  3 


Joint  Strategy  Fictitious  Play  with  Inertia  for  Potential 

Games 

In  this  chapter  we  consider  multi-player  repeated  games  involving  a  large  number  of 
players  with  large  strategy  spaces  and  enmeshed  utility  structures.  In  these  “large- 
scale”  games,  players  are  inherently  faced  with  limitations  in  both  their  observational 
and  computational  capabilities.  Accordingly,  players  in  large-scale  games  need  to 
make  their  decisions  using  algorithms  that  accommodate  limitations  in  information 
gathering  and  processing.  This  disqualifies  some  of  the  well  known  decision  making 
models  such  as  “Fictitious  Play”  (FP),  in  which  each  player  must  monitor  the  individ¬ 
ual  actions  of  every  other  player  and  must  optimize  over  a  high  dimensional  probability 
space.  We  will  show  that  Joint  Strategy  Fictitious  Play  (JSFP),  a  close  variant  of  FP, 
alleviates  both  the  informational  and  computational  burden  of  FP.  Furthermore,  we 
introduce  JSFP  with  inertia,  i.e.,  a  probabilistic  reluctance  to  change  strategies,  and 
establish  the  convergence  to  a  pure  Nash  equilibrium  in  all  generalized  ordinal  po¬ 
tential  games  in  both  cases  of  averaged  or  exponentially  discounted  historical  data. 
We  illustrate  JSFP  with  inertia  on  the  specific  class  of  congestion  games,  a  subset  of 
generalized  ordinal  potential  games.  In  particular,  we  illustrate  the  main  results  on  a 
distributed  traffic  routing  problem  and  derive  tolling  procedures  that  can  lead  to  opti¬ 
mized  total  traffic  congestion. 
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3.1  Introduction 


We  consider  “large-scale”  repeated  games  involving  a  large  number  of  players,  each  of 
whom  selects  a  strategy  from  a  possibly  large  strategy  set.  A  player’s  reward,  or  utility, 
depends  on  the  actions  taken  by  all  players.  The  game  is  repeated  over  multiple  stages, 
and  this  allows  players  to  adapt  their  strategies  in  response  to  the  available  information 
gathered  over  prior  stages.  This  setup  falls  under  the  general  subject  of  “learning 
in  games”  [FL98,  You05],  and  there  are  a  variety  of  algorithms  and  accompanying 
analysis  that  examine  the  long  term  behavior  of  these  algorithms. 

In  large-scale  games  players  are  inherently  faced  with  limitations  in  both  their 
observational  and  computational  capabilities.  Accordingly,  players  in  such  large-scale 
games  need  to  make  their  decisions  using  algorithms  that  accommodate  limitations  in 
information  gathering  and  processing.  This  limits  the  feasibility  of  different  learning 
algorithms.  For  example,  the  well-studied  algorithm  “Fictitious  Play”  (FP)  requires 
individual  players  to  individually  monitor  the  actions  of  other  players  and  to  optimize 
their  strategies  according  to  a  probability  distribution  function  over  the  joint  actions  of 
other  players.  Clearly,  such  information  gathering  and  processing  is  not  feasible  in  a 
large-scale  game. 

The  main  objective  of  this  chapter  is  to  study  a  variant  of  FP  called  Joint  Strategy 
Fictitious  Play  (JSFP)  [FL98,  FK93,  MS97].  We  will  argue  that  JSFP  is  a  plausible 
decision  making  model  for  certain  large-scale  games.  We  will  introduce  a  modification 
of  JSFP  to  include  inertia,  in  which  there  is  a  probabilistic  reluctance  of  any  player  to 
change  strategies.  We  will  establish  that  JSFP  with  inertia  converges  to  a  pure  Nash 
equilibrium  for  a  class  of  games  known  as  generalized  ordinal  potential  games,  which 
includes  so-called  congestion  games  as  a  special  case  [Ros73]. 

Our  motivating  example  for  a  large-scale  congestion  game  is  distributed  traffic 
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routing  [BL85],  in  which  a  large  number  of  vehicles  make  daily  routing  decisions  to 
optimize  their  own  objectives  in  response  to  their  own  observations.  In  this  setting, 
observing  and  responding  to  the  individual  actions  of  all  vehicles  on  a  daily  basis 
would  be  a  formidable  task  for  any  individual  driver.  A  more  realistic  measurement 
on  the  information  tracked  and  processed  by  an  individual  driver  is  the  daily  aggregate 
congestion  on  the  roads  that  are  of  interest  to  that  driver  [BPK91].  It  turns  out  that 
JSFP  accommodates  such  information  aggregation. 

We  will  now  review  some  of  the  well  known  decision  making  models  and  discuss 
their  limitations  in  large-scale  games.  See  the  monographs  [FL98,  You98,  You05, 
HS98,  Wei95]  and  survey  article  [Har05]  for  a  more  comprehensive  review. 

The  well  known  FP  algorithm  requires  that  each  player  views  all  other  players 
as  independent  decision  makers  [FL98].  In  the  FP  framework,  each  player  observes 
the  decisions  made  by  all  other  players  and  computes  the  empirical  frequencies  (i.e. 
running  averages)  of  these  observed  decisions.  Then,  each  player  best  responds  to  the 
empirical  frequencies  of  other  players’  decisions  by  first  computing  the  expected  utility 
for  each  strategy  choice  under  the  assumption  that  the  other  players  will  independently 
make  their  decisions  probabilistically  according  to  the  observed  empirical  frequencies. 
FP  is  known  to  be  convergent  to  a  Nash  equilibrium  in  potential  games,  but  need  not 
converge  for  other  classes  of  games.  General  convergence  issues  are  discussed  in 
[HM03b,  SA05,  AS04], 

The  paper  [LES05]  introduces  a  version  of  FP,  called  “sampled  FP”,  that  seeks  to 
avoid  computing  an  expected  utility  based  on  the  empirical  frequencies,  because  for 
large  scale  games,  this  expected  utility  computation  can  be  prohibitively  demanding. 
In  sampled  FP,  each  player  selects  samples  from  the  strategy  space  of  every  other 
player  according  to  the  empirical  frequencies  of  that  player’s  past  decisions.  A  player 
then  computes  an  average  utility  for  each  strategy  choice  based  off  of  these  samples. 
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Each  player  still  has  to  observe  the  decisions  made  by  all  other  players  to  compute 
the  empirical  frequencies  of  these  observed  decisions.  Sampled  FP  is  proved  to  be 
convergent  in  identical  interest  games,  but  the  number  of  samples  needed  to  guarantee 
convergence  grows  unboundedly. 

There  are  convergent  learning  algorithms  for  a  large  class  of  coordination  games 
called  “weakly  acyclic”  games  [You98].  In  adaptive  play  [You93]  players  have  finite 
recall  and  respond  to  the  recent  history  of  other  players.  Adaptive  play  requires  each 
player  to  track  the  individual  behavior  of  all  other  players  for  recall  window  lengths 
greater  than  one.  Thus,  as  the  size  of  player  memory  grows,  adaptive  play  suffers  from 
the  same  computational  setback  as  FP. 

It  turns  out  that  there  is  a  strong  similarity  between  the  JSFP  discussed  herein  and 
the  regret  matching  algorithm  [HMOO].  A  player’s  regret  for  a  particular  choice  is 
defined  as  the  difference  between  1)  the  utility  that  would  have  been  received  if  that 
particular  choice  was  played  for  all  the  previous  stages  and  2)  the  average  utility  ac¬ 
tually  received  in  the  previous  stages.  A  player  using  the  regret  matching  algorithm 
updates  a  regret  vector  for  each  possible  choice,  and  selects  actions  according  to  a 
probability  proportional  to  positive  regret.  In  JSFP,  a  player  chooses  an  action  by 
myopically  maximizing  the  anticipated  utility  based  on  past  observations,  which  is  ef¬ 
fectively  equivalent  to  regret  modulo  a  bias  term.  A  current  open  question  is  whether 
player  choices  would  converge  in  coordination-type  games  when  all  players  use  the 
regret  matching  algorithm  (except  for  the  special  case  of  two-player  games  [HM03a]). 
There  are  finite  memory  versions  of  the  regret  matching  algorithm  and  various  gen¬ 
eralizations  [You05],  such  as  playing  best  or  better  responses  to  regret  over  the  last 
m  stages,  that  are  proven  to  be  convergent  in  weakly  acyclic  games  when  players  use 
some  sort  of  inertia.  These  finite  memory  algorithms  do  not  require  each  player  to 
track  the  behavior  of  other  players  individually.  Rather,  each  player  needs  to  remem- 
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ber  the  utilities  actually  received  and  the  utilities  that  could  have  been  received  in  the 
last  m  stages.  In  contrast,  a  player  using  JSFP  best  responds  according  to  accumu¬ 
lated  experience  over  the  entire  history  by  using  a  simple  recursion  which  can  also 
incorporate  exponential  discounting  of  the  historical  data. 

There  are  also  payoff  based  dynamics,  where  each  player  observes  only  the  actual 
utilities  received  and  uses  a  Reinforcement  Learning  (RL)  algorithm  [SB98,  BT96] 
to  make  future  choices.  Convergence  of  player  choices  when  all  players  use  an  RL- 
like  algorithm  is  proved  for  identical  interest  games  [LC03,  LC05b,  LC05a]  assuming 
that  learning  takes  place  at  multiple  time  scales.  Finally,  the  payoff  based  dynamics 
with  finite-memory  presented  in  [HS04]  leads  to  a  Pareto-optimal  outcome  in  generic 
common  interest  games. 

Regarding  the  distributed  routing  setting  of  Section  3.4,  there  are  papers  that  ana¬ 
lyze  different  routing  strategies  in  congestion  games  with  “infinitesimal”  players,  i.e., 
a  continuum  of  players  as  opposed  to  a  large,  but  finite,  number  of  players.  Refer¬ 
ences  [FV04,  FV05,  FRV06]  analyze  the  convergence  properties  of  a  class  of  routing 
strategies  that  is  a  variation  of  the  replicator  dynamics  in  congestion  games,  also  re¬ 
ferred  to  as  symmetric  games,  under  a  variety  of  settings.  Reference  [BEL06]  analyzes 
the  convergence  properties  of  no-regret  algorithms  in  such  congestion  games  and  also 
considers  congestion  games  with  discrete  players,  as  considered  in  this  paper,  but  the 
results  hold  only  for  a  highly  structured  symmetric  game. 

The  remainder  of  this  chapter  is  organized  as  follows.  Section  3.2,  sets  up  JSFP 
and  goes  on  to  establish  convergence  to  a  pure  Nash  equilibrium  for  JSFP  with  iner¬ 
tia  in  all  generalized  ordinal  potential  games.  Section  3.3  presents  a  fading  memory 
variant  of  JSFP,  and  likewise  establishes  convergence  to  a  pure  Nash  equilibrium.  Sec¬ 
tion  3.4  presents  an  illustrative  example  for  traffic  congestion  games.  Section  3.4  goes 
on  to  illustrate  the  use  of  tolls  to  achieve  a  socially  optimal  equilibrium  and  derives 
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conditions  for  this  equilibrium  to  be  unique. 

3.2  Joint  Strategy  Fictitious  Play  with  Inertia 

Consider  a  finite  game  with  n-player  set  V  :=  {Vi, ...,  Vn}  where  each  player  Vi  G  V 
has  an  action  set  Ai  and  a  utility  function  Ui  :  A  — >  M  where  A  =  A\  x  ...  x  An. 

In  a  repeated  game  as  described  in  Section  2.4,  at  every  stage  t  G  {0, 1,  2, ...},  each 
player,  Vi,  simultaneously  selects  an  action  aft)  G  A, .  This  selection  is  a  function  of 
the  information  available  to  player  V%  up  to  stage  t.  Both  the  action  selection  function 
and  the  available  information  depend  on  the  underlying  learning  process. 

3.2.1  Fictitious  Play 

We  start  with  the  well  known  Fictitious  Play  (FP)  process  [FL98].  Fictitious  Play  is  an 
example  of  a  full  information  learning  algorithm. 

Define  the  empirical  frequency,  qfft),  as  the  percentage  of  stages  at  which  player 
Vi  has  chosen  the  action  a*  G  A,  up  to  time  t  —  1,  i.e., 

Vt'it)  '■=  =  di}, 

T  =  0 

where  afk)  G  A,  is  player  Vfs  action  at  time  k  and  /{•}  is  the  indicator  function. 
Now  define  the  empirical  frequency  vector  for  player  Vt  as 


where  \A-L\  is  the  cardinality  of  the  action  set  A,. 

The  action  of  player  Vt  at  time  t  is  based  on  the  (incorrect)  presumption  that  other 
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players  are  playing  randomly  and  independently  according  to  their  empirical  frequen¬ 
cies.  Under  this  presumption,  the  expected  utility  for  the  action  a  *  £  A,  is 

Ui(ai,q-i(t))  :=  Ui(di,a_i)  JJ  g“j(f),  (3.1) 

a_jG^4— i  CLj€.(L—i 

wher eq-i(t)  :=  {q^t), q^^t),  qi+1(t), qn(t)}  and  A-i  :  =  xj+iAy  In  the  FP 

process,  player  V%  uses  this  expected  utility  by  selecting  an  action  at  time  t  from  the 
set 


BRi(q-i(t ))  :=  (a;  £  Ai  :  g_*(f))  =  max  (7*  (a*,  <?_*(£))}. 

aieA 

The  set  BRi(q_i(t ))  is  called  player  Pj’s  best  response  to  q-i(t).  In  case  of  a  non¬ 
unique  best  response,  player  Vt  makes  a  random  selection  from  BRi{q_i{t)). 

It  is  known  that  the  empirical  frequencies  generated  by  FP  converge  to  a  Nash 
equilibrium  in  potential  games  [MS96b]. 

Note  that  FP  as  described  above  requires  each  player  to  observe  the  actions  made 
by  every  other  individual  player.  Moreover,  choosing  an  action  based  on  the  predic¬ 
tions  (3.1)  amounts  to  enumerating  all  possible  joint  actions  in  XjAj  at  every  stage  for 
each  player.  Hence,  FP  is  computationally  prohibitive  as  a  decision  making  model  in 
large-scale  games. 

3.2.2  Setup:  Joint  Strategy  Fictitious  Play 

In  JSFP,  each  player  tracks  the  empirical  frequencies  of  the  joint  actions  of  all  other 
players.  In  contrast  to  FP,  the  action  of  player  Vt  at  time  t  is  based  on  the  (still  in¬ 
correct)  presumption  that  other  players  are  playing  randomly  but  jointly  according  to 
their  joint  empirical  frequencies,  i.e.,  each  player  views  all  other  players  as  a  collective 
group. 

Let  za(t )  be  the  percentage  of  stages  at  which  all  players  chose  the  joint  action 
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profile  a  €  A  up  to  time  t  —  1,  i.e., 


2°(f)  :=  7^-TMt)  =  a}.  (3.2) 

T— 0 

Let  z(t)  denote  the  empirical  frequency  vector  formed  by  the  components  {za{t)}a&A- 
Note  that  the  dimension  of  z(t)  is  the  cardinality  |^4|. 

Similarly,  let  z^-'  (t)  be  the  percentage  of  stages  at  which  players  other  then  player 
Vi  have  chosen  the  joint  action  profile  a_t  £  A-i  up  to  time  t  —  1,  i.e., 

1 

2-7* (f)  :=  7  J{a-i(r)  =  a-*}>  (3-3) 

T— 0 

which,  given  z(t),  can  also  be  expressed  as 

z-T(t)  =  V 

CLi^Ai 

Let  z-i(t)  denote  the  empirical  frequency  vector  formed  by  the  components 
{z-7 Note  that  the  dimension  of  z-i(t)  is  the  cardinality  |  Xi^jAj\. 

Similarly  to  FP,  player  V,  's  action  at  time  t  is  based  on  an  expected  utility  for  the 
action  a*  £  A,  ,  but  now  based  on  the  joint  action  model  of  opponents  given  by1 

Ui(a,i,z-i(t ))  :=  ^2  Ui(ai,a-i)z^7(t).  (3.4) 

d— 

In  the  JSFP  process,  player  V,  uses  this  expected  utility  by  selecting  an  action  at  time 
t  from  the  set 


BRi(z-i(t ))  :=  {di  £  At  :  Ui{au  z_i(t))  =  max  Ui(ah  ^-j(f))}. 

aiGAi 

Note  that  the  utility  as  expressed  in  (3.4)  is  linear  in  Z-i(t). 

When  written  in  this  form,  JSFP  appears  to  have  a  computational  burden  for  each 

player  that  is  even  higher  than  that  of  FP,  since  tracking  the  empirical  frequencies 

1Note  that  we  use  the  same  notation  for  the  related  quantities  U (a,;,  a-i),  U ( a*,  ),  and  U(ai,  Z-i), 

where  the  latter  two  are  derived  from  the  first  as  defined  in  equations  (3.1)  and  (3.4),  respectively. 
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z-i(t)  G  A(A_i)  of  the  joint  actions  of  the  other  players  is  more  demanding  for  player 
Vi  than  tracking  the  empirical  frequencies  q~i(t)  G  Xj^A(Aj)  of  the  actions  of  the 
other  players  individually,  where  A  (.A)  denotes  the  set  of  probability  distributions 
on  a  finite  set  A.  However,  it  is  possible  to  rewrite  JSFP  to  significantly  reduce  the 
computational  burden  on  each  player. 

To  choose  an  action  at  any  time,  t,  player  Vi  using  JSFP  needs  only  the  predicted 
utilities  Z-fit))  for  each  a,  G  At.  Substituting  (3.3)  into  (3.4)  results  in 

1  t"1 

Ui(ai,  z-i(t))  =  Ui(di,a-i(r )), 

T— 0 

which  is  the  average  utility  player  Vt  would  have  received  if  action  a,  had  been  chosen 
at  every  stage  up  to  time  t  —  1  and  other  players  used  the  same  actions.  This  average 
utility,  denoted  by  V-1'  it).  admits  the  following  simple  recursion, 

V^{t  +  1)  =  £  +  +  Y^-^Ui(ai,a_i(t)). 

The  important  implication  is  that  JSFP  dynamics  can  be  implemented  without  requir¬ 
ing  each  player  to  track  the  empirical  frequencies  of  the  joint  actions  of  the  other 
players  and  without  requiring  each  player  to  compute  an  expectation  over  the  space  of 
the  joint  actions  of  all  other  players.  Rather,  each  player  using  JSFP  merely  updates 
the  predicted  utilities  for  each  available  action  using  the  recursion  above,  and  chooses 
an  action  each  stage  with  maximal  predicted  utility. 

An  interesting  feature  of  JSFP  is  that  each  strict  Nash  equilibrium  has  an  “absorp¬ 
tion”  property  as  summarized  in  Proposition  3.2.1. 

Proposition  3.2.1.  In  any  finite  n-person  game,  if  at  any  time  t  >  0,  the  joint  action 
a(t )  generated  by  a  JSFP  process  is  a  strict  Nash  equilibrium,  then  a{t  +  r)  =  ait  ) 
for  all  t  >  0. 
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Proof.  For  each  player  V,  G  V  and  for  all  actions  a,  G  At, 

Z—ft))  A  Ufa^z.ft)). 


Since  aft)  is  a  strict  Nash  equilibrium,  we  know  that  for  all  actions  a*  G  Afaft ) 


By  writing  Z-ft  +  1)  in  terms  of  Z-ft)  and  a-ft), 

Ui(ai(t),z-i(t  +  1))  =  Ui(ai(t),z-i(t ))  +  j^-^Ufaft),  a-ft)). 

Therefore,  aft)  is  the  only  best  response  to  z^ft  +  1), 


Ufaft),  Z—i  {t  +  1))  >  Ufai,  Z—ift  +  1)),  Vcq  G  Afaft). 


□ 

A  strict  Nash  equilibrium  need  not  possess  this  absorption  property  in  general  for 
standard  FP  when  there  are  more  than  two  players.2 

The  convergence  properties,  even  for  potential  games,  of  JSFP  in  the  case  of  more 
than  two  players  is  unresolved.3  We  will  establish  convergence  of  JSFP  in  the  case 
where  players  use  some  sort  of  inertia,  i.e.,  players  are  reluctant  to  switch  to  a  better 
action. 

The  JSFP  with  inertia  process  is  defined  as  follows.  Players  choose  their  actions 
according  to  the  following  rules: 

2To  see  this,  consider  the  following  3  player  identical  interest  game.  For  all  V,  £  V ,  let  A,  = 
{a,b}.  Let  the  utility  be  defined  as  follows:  U(a,  b,  a)  =  U(b,a,a)  =  1  ,U(a,a,a)  =  U{b,b,a)  = 
0,U(a,a1b )  =  U(b,b,b)  =  1  ,U(a,b,b)  =  —1  ,U(b,a,b)  =  —100.  Suppose  the  first  action  played 
is  a(l)  =  {a,  a,  a}.  In  the  FP  process  each  player  will  seek  to  deviate  in  the  ensuing  stage,  a( 2)  = 
{&,  b,  6}.  The  joint  action  {6,  b,  b}  is  a  strict  Nash  equilibrium.  One  can  easily  verify  that  the  ensuing 
action  in  a  FP  process  will  be  a(3)  =  {a,  b,  a}.  Therefore,  a  strict  Nash  equilibrium  is  not  absorbing  in 
the  FP  process  with  more  than  2  players. 

3For  two  player  games,  JSFP  and  standard  FP  are  equivalent,  hence  the  convergence  results  for  FP 
hold  for  JSFP. 
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JSFP-1:  If  the  action  aft  —  1)  chosen  by  player  Vt  at  time  t  —  1  belongs  to 

BRfz-ft )),  then  aft)  =  aft  —  1). 

JSFP-2:  Otherwise,  player  V,  chooses  an  action,  aft),  at  time  f  according  to 
the  probability  distribution 

di(t)Pi(t)  +  (1  -  aft))vai(t~l\ 


where  aft)  is  a  parameter  representing  player  Vfs  willingness  to  optimize  at 
time  t,  /3ft)  G  A(Ai)  is  any  probability  distribution  whose  support  is  contained 
in  the  set  BRfz-ft)),  and  v®*^-  1  is  the  probability  distribution  with  full  sup¬ 
port  on  the  action  aft  —  1),  i.e., 


0 


vOi(t-l) 


1 


0 


where  the  “1”  occurs  in  the  coordinate  of  A(Ai)  associated  with  a,i(t  —  1). 


According  to  these  rules,  player  V,  will  stay  with  the  previous  action  al(t  —  1) 
with  probability  1  —  «;(f)  even  when  there  is  a  perceived  opportunity  for  utility  im¬ 
provement.  We  make  the  following  standing  assumption  on  the  players’  willingness  to 
optimize. 

Assumption  3.2.1.  There  exist  constants  e  and  e  such  that  for  all  time  t  >  0  and  for 
cdl  players  V,  G  V, 

0  <  e  <  aft)  <  e  <  1. 
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This  assumption  implies  that  players  are  always  willing  to  optimize  with  some 
nonzero  inertia4. 

The  following  result  shows  a  similar  absorption  property  of  pure  Nash  equilibria 
in  a  JSFP  with  inertia  process. 

Proposition  3.2.2.  In  any  finite  n-person  game,  if  at  any  time  t  >  0  the  joint  action 
aft)  generated  by  a  JSFP  with  inertia  process  is  1)  a  pure  Nash  equilibrium  and  2)  the 
action  aft)  G  BRfz-ft))  for  all  players  Vi  G  V,  then  aft  +  r)  =  aft)  for  all  r  >  0. 

Proof  For  each  player  V,  G  V  and  for  all  actions  a,  G  A,, 

U%  ifliif)  i  Z —i(t))  V  Ufa^Z-ft)). 

Since  a(t)  is  a  pure  Nash  equilibrium,  we  know  that  for  all  actions  a,  G  A, 

Ufa  ft),  a-ft))  >  Ufa^a-ft)). 

By  writing  Z-ft  +  1)  in  terms  of  Z-ft)  and  a-ft), 

Ufa  ft),  z-ft  +  1))  =  j^—^Ufaft),  z-ft))  +  -^—AJfaft),  a-ft)). 
Therefore,  aft)  is  also  a  best  response  to  Z-ft  +  1), 

Ufa  ft),  z-ft  +  1))  >  Ufai ,  z-ft  +  1)),  Va*  G  Ai- 
Since  aft)  G  BRfz-ft  +  1))  for  all  players,  then  aft  +  1)  =  aft).  □ 

3.2.3  Convergence  to  Nash  Equilibrium 

The  following  establishes  the  main  result  regarding  the  convergence  of  JSFP  with  in¬ 
ertia. 

We  will  assume  that  no  player  is  indifferent  between  distinct  strategies5. 

4This  assumption  can  be  relaxed  to  holding  for  sufficiently  large  t,  as  opposed  to  all  t. 

°One  could  alternatively  assume  that  all  pure  equilibria  are  strict. 
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Assumption  3.2.2.  Player  utilities  satisfy 


Ufa],  a_i)  f  Ufa],  a_»),  V  a],  a ■  G  A:,  a,-  7^  a],  V  a_;  G  A_»,  V  i  G  {1, ...,  n}. 

(3.5) 

Theorem  3.2.1.  In  any  finite  generalized  ordinal  potential  game  in  which  no  player 
is  indifferent  between  distinct  strategies  as  in  Assumption  3.2.2,  the  action  profiles 
a(t )  generated  by  JSFP  with  inertia  under  Assumption  3.2.1  converge  to  a  pure  Nash 
equilibrium  almost  surely. 

We  provide  a  complete  proof  of  Theorem  3.2.1  in  the  Appendix  of  this  chapter.  We 
encourage  the  reader  to  first  review  the  proof  of  fading  memory  JSFP  with  inertia  in 
Theorem  3.3.1  of  the  following  section. 

3.3  Fading  Memory  JSFP  with  Inertia 

We  now  analyze  the  case  where  players  view  recent  information  as  more  important. 
In  fading  memory  JSFP  with  inertia,  players  replace  true  empirical  frequencies  with 
weighted  empirical  frequencies  defined  by  the  recursion 

zffff)  :=  /(a-i(O)  =  d-i}, 

Z-T(t)  :=  (1  —  —  1)  +  pl{a_ft  —  1)  =  d_j},  Vf  >  1, 

where  0  <  p  <  1  is  a  parameter  with  (1— p)  being  the  discount  factor.  Let  Z-ft)  denote 
the  weighted  empirical  frequency  vector  formed  by  the  components  {zfyf  (f)}a_,^A-r 
Note  that  the  dimension  of  Z-ft)  is  the  cardinality  \A-f\. 

One  can  identify  the  limiting  cases  of  the  discount  factor.  When  p  =  1  we  have 
“Cournot”  beliefs,  where  only  the  most  recent  information  matters.  In  the  case  when 
p  is  not  a  constant,  but  rather  p(t)  =  l/(f  +  1),  all  past  information  is  given  equal 
importance  as  analyzed  in  Section  3.2. 


34 


Utility  prediction  and  action  selection  with  fading  memory  are  done  in  the  same 
way  as  in  Section  3.2,  and  in  particular,  in  accordance  with  rules  JSFP-1  and  JSFP-2. 
To  make  a  decision,  player  Vi  needs  only  the  weighted  average  utility  that  would  have 
been  received  for  each  action,  which  is  defined  for  action  a*  G  A,  as 

V?i(t):=Ufai,z-ft))  =  Uf^a-fz^ft). 

CL—i(zL*A—i 

One  can  easily  verify  that  the  weighted  average  utility  V-  ’"'  (t)  for  action  a*  G  A,  admits 
the  recursion 

Kft)  =  pUfdu  a-ft  -  1))  +  (1  -  p)Vfft  -  1). 

Once  again,  player  V,  is  not  required  to  track  the  weighted  empirical  frequency  vector 
z~ft)  or  required  to  compute  expectations  over  A-i. 

As  before,  pure  Nash  equilibria  have  an  absorption  property  under  fading  memory 
JSFP  with  inertia. 

Proposition  3.3.1.  In  any  finite  n-person  game,  if  at  any  time  t  >  0  the  joint  action 
aft)  generated  by  a  fading  memory  JSFP  with  inertia  process  is  1)  a  pure  Nash  equilib¬ 
rium  and  2)  the  action  aft)  G  BRfz-ft))  for  all  players  Vi  G  V,  then  a(t+t)  =  a(t) 
for  all  t  >  0. 

Proof  For  each  player  V,  G  V  and  for  all  actions  a,  G  A%, 

Ufaft),  z-ft))  A  Ufai,  Z-i(t)). 

Since  a(t)  is  a  pure  Nash  equilibrium,  we  know  that  for  all  actions  at  G  A% 

Ufaft),  a-ft))  f  Ufai,a-ft)). 

By  writing  Z-ft  +  1)  in  terms  of  Z-ft)  and  a-ft), 

Ufa  ft),  z-ft  +  1))  =  (1  -  p)Ufaft),  z-ft ))  +  pUfaft),  a-ft)). 
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Therefore,  aft)  is  also  a  best  response  to  Z-fit  +  1), 

Ui(cLi(t ),  Z-i(t  +  1))  >  Ui(cLi ,  +  1)),  Vai  G  A- 

Since  aft)  G  BRfz-ft  +  1))  for  all  players,  then  aft  +  1)  =  a(f).  □ 

The  following  theorem  establishes  convergence  to  Nash  equilibrium  for  fading 
memory  JSFP  with  inertia. 

Theorem  3.3.1.  In  any  finite  generalized  ordinal  potential  game  in  which  no  player  is 
indifferent  between  distinct  strategies  as  in  Assumption  3.2.2,  the  action  profiles  a(t ) 
generated  by  a  fading  memory  JSFP  with  inertia  process  satisfying  Assumption  3.2.1 
converge  to  a  pure  Nash  equilibrium  almost  surely. 

Proof.  The  proof  follows  a  similar  structure  to  the  proof  of  Theorem  6.2  in  [You05]. 
At  time  t,  let  a0  :=  a(t).  There  exists  a  positive  constant  T,  independent  of  t,  such 
that  if  the  current  action  a0  is  repeated  T  consecutive  stages,  i.e.  aft)  =  ...  =  aft  + 
T  —  1)  =  a0,  then  BRfz^ft  +  T))  =  BRfaff)  6  for  all  players.  The  probability 
of  such  an  event  is  at  least  (1  —  £)n('r  where  n  is  the  number  of  players.  If  the 
joint  action  a0  is  an  equilibrium,  then  by  Proposition  3.3.1  we  are  done.  Otherwise, 
there  must  be  at  least  one  player  Vt(\)  G  V  such  that  f  BRi^(a°_ and  hence 
ai(i)  ^  BRi(i){z_i(i){t  +  T)). 

Consider  now  the  event  that,  at  time  t  +  T,  exactly  one  player  switches  to  a  dif¬ 
ferent  action,  i.e.,  a 1  :=  a(t  +  T)  =  «(ipa°i(i))  for  some  player  Tup)  G  V  where 
Cj(i)(a1)  >  Cj(i)(a°).  This  event  happens  with  probability  at  least  e{\  —  £)n_1.  Note 
that  if  (/)(•)  is  a  generalized  ordinal  potential  function  for  the  game,  then  o{a{y)  <  ©(a1 ). 

Continuing  along  the  same  lines,  if  the  current  action  a 1  is  repeated  T  consecutive 
stages,  i.e.  aft  +  T)  =  ...  =  aft  +  2T  —  1)  =  a1,  then  BRfiz-fit  +  2 T))  =  BRfiaff) 

BSince  no  player  is  indifferent  between  distinct  strategies,  the  best  response  to  the  current  action 
profile,  BR-daN),  is  a  singleton. 
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for  all  players.  The  probability  of  such  an  event  is  at  least  (1  —  e)n(r  '7  If  the  joint 
action  a 1  is  an  equilibrium,  then  by  Proposition  3.3.1,  we  are  done.  Otherwise,  there 
must  be  at  least  one  player  'P,(2)  €  V  such  that  a\(2,  qL  BR^ and  hence 

al( 2)  ^  BRi(2){z-i(2){t  +  2 T)). 

One  can  repeat  the  arguments  above  to  construct  a  sequence  of  profiles 
a0,  a1,  a2, ...,  am,  where  ak  =  a^* )  f°r  A-  >  1,  with  the  property  that 

0(a°)  <  ^(a1)  <  ...  <  4>(am), 

and  am  is  an  equilibrium.  This  means  that  given  {5_,  (f)}”=1,  there  exist  constants 
T  =  (|*4|  +  1)T  >  0, 

e  =  (e(l-e)n-1)W((l-e)n(T-1>)l>l|+1  >  0, 

both  of  which  are  independent  of  t,  such  that  the  following  event  happens  with  prob¬ 
ability  at  least  e:  a(t  +  T)  is  an  equilibrium  and  cq(£  +  T)  e  BR^z^t  +  T))  for 
all  players  V%  €  V.  This  implies  that  ait)  converges  to  a  pure  equilibrium  almost 
surely.  □ 

3.4  Congestion  Games  and  Distributed  Traffic  Routing 

In  this  section,  we  illustrate  the  main  results  on  congestion  games,  as  defined  in  Sec¬ 
tion  2.3.3,  which  are  a  special  case  of  the  generalized  ordinal  potential  games  ad¬ 
dressed  in  Theorems  3.2.1  and  3.3.1.  We  illustrate  these  results  on  a  simulation  of 
distributed  traffic  routing.  We  go  on  to  discuss  how  to  modify  player  utilities  in  dis¬ 
tributed  traffic  routing  to  allow  a  centralized  planner  to  achieve  a  desired  collective 
objective  through  distributed  learning. 
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3.4.1  Distributed  Traffic  Routing 


We  consider  a  congestion  game,  as  defined  in  Section  2.3.3,  with  100  players,  or 
drivers,  seeking  to  traverse  from  node  A  to  node  B  along  10  different  parallel  roads 
as  illustrated  in  Figure  3.1.  Each  driver  can  select  any  road  as  a  possible  route.  In 


Figure  3.1:  Fading  Memory  JSFP  with  Inertia:  Congestion  Game  Example  -  Network  Topol¬ 
ogy 

terms  of  congestion  games,  the  set  of  resources  is  the  set  of  roads,  7 Z,  and  each  player 
can  select  one  road,  i.e.,  Ai  =  1Z. 

Each  road  has  a  quadratic  cost  function  with  positive  (randomly  chosen)  coeffi¬ 
cients, 

cn(k)  =  dik 2  +  bik  +  Ci,  i  =  1, ...,  10, 

where  k  represent  the  number  of  vehicles  on  that  particular  road.  The  actual  coeffi¬ 
cients  are  unimportant  as  we  are  just  using  this  example  as  an  opportunity  to  illustrate 
the  convergence  properties  of  the  algorithm  fading  memory  JSFP  with  inertia.  This 
cost  function  may  represent  the  delay  incurred  by  a  driver  as  a  function  of  the  number 
of  other  drivers  sharing  the  same  road. 

We  simulated  a  case  where  drivers  choose  their  initial  routes  randomly,  and  every 
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day  thereafter,  adjusted  their  routes  using  fading  memory  JSFP  with  inertia.  The  pa¬ 
rameters  oti(t)  are  chosen  as  0.5  for  all  days  and  all  players,  and  the  fading  memory 
parameter  p  is  chosen  as  0.03.  The  number  of  vehicles  on  each  road  fluctuates  initially 
and  then  stabilizes  as  illustrated  in  Figure  3.2.  Figure  3.3  illustrates  the  evolution  of  the 
congestion  cost  on  each  road.  One  can  observe  that  the  congestion  cost  on  each  road 
converges  approximately  to  the  same  value,  which  is  consistent  with  a  Nash  equilib¬ 
rium  with  large  number  of  drivers.  This  behavior  resembles  an  approximate  “Wardrop 
equilibrium”  [War52],  which  represents  a  steady-state  situation  in  which  the  conges¬ 
tion  cost  on  each  road  is  equal  due  to  the  fact  that,  as  the  number  of  drivers  increases, 
the  effect  of  an  individual  driver  on  the  traffic  conditions  becomes  negligible. 


Figure  3.2:  Fading  Memory  JSFP  with  Inertia:  Evolution  of  Number  of  Vehicles  on  Each 
Route 

Note  that  FP  could  not  be  implemented  even  on  this  very  simple  congestion  game. 
A  driver  using  FP  would  need  to  track  the  empirical  frequencies  of  the  choices  of  the 
99  other  drivers  and  compute  an  expected  utility  evaluated  over  a  probability  space  of 
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Figure  3.3:  Fading  Memory  JSFP  with  Inertia:  Evolution  of  Congestion  Cost  on  Each  Route 

dimension  10". 

It  turns  out  that  JSFP,  fading  memory  JSFP,  or  other  virtual  payoff  based  learning 
algorithms  are  strongly  connected  to  actual  driver  behavioral  models.  Consider  the 
driver  adjustment  process  considered  in  [BPK91]  which  is  illustrated  in  Figure  3.4. 
The  adjustment  process  highlighted  is  precisely  JSFP  with  Inertia. 

3.4.2  Incorporating  Tolls  to  Minimize  the  Total  Congestion 

It  is  well  known  that  a  Nash  equilibrium  may  not  minimize  the  total  congestion  ex¬ 
perienced  by  all  drivers  [Rou03].  In  this  section,  we  show  how  a  global  planner  can 
minimize  the  total  congestion  by  implementing  tolls  on  the  network.  The  results  are 
applicable  to  general  congestion  games,  but  we  present  the  approach  in  the  language 
of  distributed  traffic  routing. 
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Figure  3.4:  Example  of  a  Driver  Adjustment  Process 

The  total  congestion  experienced  by  all  drivers  on  the  network  is 

Tc(a )  :=  y^oy(a)cr(qy(a)). 

re  TZ 

Define  a  new  congestion  game  where  each  driver’s  utility  takes  the  form 

uiia)  =  {cr(vr(a))  +  tr(ar(a))) , 

rGai 

where  tr(-)  is  the  toll  imposed  on  road  r  which  is  a  function  of  the  number  of  users  of 
road  r. 

The  following  proposition,  which  is  a  special  case  of  Proposition  5.3.1,  outlines 
how  to  incorporate  tolls  so  that  the  minimum  congestion  solution  is  a  Nash  equilib¬ 
rium.  The  approach  is  similar  to  the  taxation  approaches  for  nonatomic  congestion 
games  proposed  in  [Mil04,  San02]. 
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Proposition  3.4.1.  Consider  a  congestion  game  of  any  network  topology.  If  the  im¬ 
posed  tolls  are  set  as 

tr(k )  —  (k  —  1  )[cr(k)  —  cr(k  —  1)],  \/k  >  1, 

then  the  total  negative  congestion  experienced  by  cdl  drivers,  <j>c(a)  :=  —Tc(a),  is  a 
potential  function  for  the  congestion  game  with  tolls. 

By  implementing  the  tolling  scheme  set  forth  in  Proposition  3.4.1,  we  guarantee 
that  all  action  profiles  that  minimize  the  total  congestion  experienced  on  the  network 
are  equilibria  of  the  congestion  game  with  tolls.  However,  there  may  be  addition  equi¬ 
libria  at  which  an  inefficient  operating  condition  can  still  occur.  The  following  propo¬ 
sition  establishes  the  uniqueness  of  a  strict  Nash  equilibrium  for  congestion  games  of 
parallel  network  topologies  such  as  the  one  considered  in  this  example. 

Proposition  3.4.2.  Consider  a  congestion  game  with  nondecreasing  congestion  func¬ 
tions  where  each  driver  is  allowed  to  select  any  one  road,  i.e.  A,:  =  IZ  for  cdl  drivers. 
If  the  congestion  game  has  at  least  one  strict  equilibrium,  then  cdl  equilibria  have  the 
same  aggregate  vehicle  distribution  over  the  network.  Furthermore,  all  equilibria  are 
strict. 

Proof.  Suppose  action  profiles  a 1  and  a2  are  equilibria  with  a1  being  a  strict  equi¬ 
librium.  We  will  use  the  shorthand  notation  af  to  represent  ar  (a 1 ) .  Let  cr(a1)  :  = 
fcr"' , ...,  crfn)  and  er(a2)  :=  [afx , ...,  off  be  the  aggregate  vehicle  distribution  over  the 

network  for  equilibrium  a1  and  a2.  If  a  [a 1 )  f  a  (a2),  there  exists  a  road  a  such  that 

12  12 

cr“  >  cr“  and  a  road  b  such  that  <  a'f .  Therefore,  we  know  that 

CaK1)  >  Ca(af  +  1), 

Cb(crf)  >  cb(af  +  1). 
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Since  a 1  and  a 2  are  equilibrium  with  a1  being  strict, 

Ca(o-f)  <  cri(cr£  +  l),  Vri  e7Z, 

cb(crf)  <  cri«i2  + 1),  in  en. 

Using  the  above  inequalities,  we  can  show  that 

CaK1)  >  Ca(af  +  1)  >  Cb(af)  >  cb(af  +  1)  >  ca(crf), 

which  gives  us  a  contradiction.  Therefore  ^(a1)  =  tx(a2).  Since  a1  is  a  strict  equilib¬ 
rium,  then  a2  must  be  a  strict  equilibrium  as  well.  □ 

When  the  tolling  scheme  set  forth  in  Proposition  3.4.1  is  applied  to  the  congestion 
game  example  considered  previously,  the  resulting  congestion  game  with  tolls  is  a  po¬ 
tential  game  in  which  no  player  is  indifferent  between  distinct  strategies.  Proposition 
3.4.1  guarantees  us  that  the  action  profiles  that  minimize  the  total  congestion  experi¬ 
enced  by  all  drivers  on  the  network  are  in  fact  strict  equilibria  of  the  congestion  game 
with  tolls.  Furthermore,  if  the  new  congestion  functions  are  nondecreasing7,  then  by 
Proposition  3.4.2,  all  strict  equilibria  must  have  the  same  aggregate  vehicle  distribu¬ 
tion  over  the  network,  and  therefore  must  minimize  the  total  congestion  experienced 
by  all  drivers  on  the  network.  Therefore,  the  action  profiles  generated  by  fading  mem¬ 
ory  JSFP  with  inertia  converge  to  an  equilibrium  that  minimizes  the  total  congestion 
experienced  by  all  users,  as  shown  in  Figure  3.5. 

3.5  Concluding  Remarks  and  Future  Work 

We  have  analyzed  the  long-term  behavior  of  a  large  number  of  players  in  large-scale 
games  where  players  are  limited  in  both  their  observational  and  computational  capa¬ 
bilities.  In  particular,  we  analyzed  a  version  of  JSFP  and  showed  that  it  accommodates 

'  Simple  conditions  on  the  original  congestion  functions  can  be  established  to  guarantee  that  the  new 
congestion  functions,  i.e  congestion  plus  tolls,  are  nondecreasing. 
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Figure  3.5:  Fading  Memory  JSFP  with  Inertia:  Evolution  of  Total  Congestion  Experienced  by 
All  Drivers  with  and  without  Tolls. 


inherent  player  limitations  in  information  gathering  and  processing.  Furthermore,  we 
showed  that  JSFP  has  guaranteed  convergence  to  a  pure  Nash  equilibrium  in  all  gen¬ 
eralized  ordinal  potential  games,  which  includes  but  is  not  limited  to  all  congestion 
games,  when  players  use  some  inertia  either  with  or  without  exponential  discounting 
of  the  historical  data.  The  methods  were  illustrated  on  a  transportation  congestion 
game,  in  which  a  large  number  of  vehicles  make  daily  routing  decisions  to  optimize 
their  own  objectives  in  response  to  the  aggregate  congestion  on  each  road  of  interest. 
An  interesting  continuation  of  this  research  would  be  the  case  where  players  observe 
only  the  actual  utilities  they  receive.  This  situation  will  be  the  focus  of  Chapter  5. 

The  method  of  proof  of  Theorems  3.2.1  and  3.3.1  relies  on  inertia  to  derive  a  pos¬ 
itive  probability  of  a  single  player  seeking  to  make  an  utility  improvement,  thereby 
increasing  the  potential  function.  This  suggests  a  convergence  rate  that  is  exponential 
in  the  game  size,  i.e.,  number  of  players  and  actions.  It  should  be  noted  that  inertia 
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is  simply  a  proof  device  that  assures  convergence  for  generic  potential  games.  The 
proof  provides  just  one  out  of  multiple  paths  to  convergence.  The  simulations  reflect 
that  convergence  can  be  much  faster.  Indeed,  simulations  suggest  that  convergence 
is  possible  even  in  the  absence  of  inertia  but  not  necessarily  for  all  potential  games. 
Furthermore,  recent  work  [HM06]  suggests  that  convergence  rates  of  a  broad  class 
of  distributed  learning  processes  can  be  exponential  in  the  game  size  as  well,  and  so 
this  seems  to  be  a  limitation  in  the  framework  of  distributed  learning  rather  than  any 
specific  learning  process  (as  opposed  to  centralized  algorithms  for  computing  an  equi¬ 
librium). 


3.6  Appendix  to  Chapter  3 


3.6.1  Proof  of  Theorem  3.2.1 

This  section  is  devoted  to  the  proof  of  Theorem  3.2.1.  It  will  be  helpful  to  note  the 
following  simple  observations: 

1.  The  expression  for  Ut(at,  Z-i(t))  in  equation  (3.4)  is  linear  in  Z-i(t). 

2.  If  an  action  profile,  a0  G  A,  is  repeated  over  the  interval  [t,  t  +  N  —  1],  i.e., 

a{t )  =  a{t  +  1)  =  ...  =  a{t  +  N  —  1)  —  a°, 


then  z(t  +  N)  can  be  written  as 

z(t  +  N)  —  — z(t)  +  v“°, 

v  '  t  +  N  w  t  +  N 

and  likewise  z-i(t  +  N )  can  be  written  as 


z-iit  +  N)  =  rT^  z-i(t)  + 


t  +  N 


t  +  N 
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We  begin  by  defining  the  quantities  Si(t),  Mu,  mu,  and  7  as  follows.  Assume  that 
player  V%  played  a  best  response  at  least  one  time  in  the  period  [0,  t],  where  t  G  [0,  00). 
Define 

5i(t)  :=  min{0  <  r  <  t  :  a*(f  —  r)  G  BRi(zi(t  —  r))}. 

In  other  words,  t  —  5i(t)  is  the  last  time  in  the  period  [0,  t]  at  which  player  V,  played 
a  best  response.  If  player  V%  never  played  a  best  response  in  the  period  [0,  t],  then  we 
adopt  the  convention  St(t)  =  00.  Note  that 

(kit  -  t)  =  Vr  G  {0, 1, ....  min f}}. 


Now  define 

Mu  ■=  max{| Ui(a})  —  Ui{a2)\  :  a1,  a2  G  A,  Vi  G  V}, 

mu  :=  minjlf/^a1)  —  Ui(a2)\  :  \Uiia1)  —  Ui(a2)\  >  0 ,a1,a2  G  A,  Vi  G  V}, 

7  :=  \Mu/mu] , 

where  [ •]  denotes  integer  ceiling. 

The  proof  of  fading  memory  JSFP  with  inertia  relied  on  a  notion  of  memory  dom¬ 
inance.  This  means  that  if  the  current  action  profile  is  repeated  a  sufficient  number  of 
times  (finite  and  independent  of  time)  then  a  best  response  to  the  weighted  empirical 
frequencies  is  equivalent  to  a  best  response  to  the  current  action  profile  and  hence  will 
increase  the  potential  provided  that  there  is  only  a  unique  deviator.  This  will  always 
happen  with  at  least  a  fixed  (time  independent)  probability  because  of  the  players’ 
inertia. 

In  the  non-discounted  case  the  memory  dominance  approach  will  not  work  for  the 
reason  that  the  probability  of  dominating  the  memory  because  of  the  players’  inertia 
diminishes  with  time.  However,  the  following  claims  show  that  one  does  not  need  to 
dominate  the  entire  memory,  but  rather  just  the  portion  of  time  for  which  the  player 
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was  playing  a  suboptimal  action.  By  dominating  this  portion  of  the  memory,  one  can 
guarantee  that  a  unilateral  best  response  to  the  empirical  frequencies  will  increase  the 
potential.  This  is  the  fundamental  idea  in  the  proof  of  Theorem  3.2.1. 

Claim  3.6.1.  Consider  a  player  V,  with  5  ft )  <  oo.  Let  t\  be  any  finite  integer  satisfy¬ 
ing 

ti  >  7 Sft). 

If  an  action  profile,  a0  G  A,  is  repeated  over  the  interx’al  [ t ,  t  +  1 1 1,  i.e., 

aft )  =  a{t  +  !)  =  •••  =  a{t  +  tf)  =  a0, 


then 

di  G  BRi(z-i(t  +  ti  +  1))  Ui(di,  a®_f  >  Ufa®,  a®_f), 

i.e.,  player  V,  ’s  best  response  at  time  t  +  ti  +  1  cannot  be  a  worse  response  to  a®_i  than 
a®. 

Proof.  Since  a,  G  BRfz-ft  +  t\  +  1)), 

Ufai,  z_ft  +  ti  +  1))  —  Ui(a®,  z_i(t  +  ti  +  1))  >  0. 

Expressing  Z-ft  +  ti  +  1)  as  a  summation  over  the  intervals  [0,t  —  5ft)  —  1],  [ t  — 
5ft),t  —  1],  and  [t,  t  +  tf  and  using  the  definition  (3.4)  leads  to 

it1  +  l)[Ufai,a®_i)-Ufa®,a®_i)} 

t-  i 

+  [Ufdi,a-fT))-Ufa®,a-fT))] 

T=t-Si(t) 

+{t  ~  5  ft))  [Ufdi,  z_ft  -  5  ft)))  -  Ufa®,  z_ft  -  ^(t)))]  >  0. 

Now,  since 

aft  -  5ft))  =  aft  -  5ft)  +  !)  =  •••  =  aft)  =  a®  G  BRfz-ft  -  5ft))), 
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meaning  that  the  third  term  above  is  negative,  and  so 


(*t  +  l 

t- 1 

+  X]  >  o. 

T=t  —  Si(t ) 

This  implies  that 


[Ui(a,i,  a°  J  a°  J]  >  _  ^  >  _mtij 

M  T  1 


or,  alternatively, 


Ui(a^,  a°_ J  -  Ui(ah  a <  m, 


If  the  quantity  in  brackets  were  positive,  this  would  violate  the  definition  of  m„  — 
unless  di  =  a°.  In  either  case, 


<)  -  a^)  >0. 


□ 

There  are  certain  action  profile/empirical  frequency  values  where  the  next  play  is 
“forced”.  Define  the  time-dependent  (forced-move)  set  T(t)  C  A  x  A(*4)  as 

(a,  z)  G  JT(t) 


«=>• 

di  G  BRi  +  ^yva-^  >  Vi  G  {!>  n)  • 

So  the  condition  (a(t),  z(t))  G  .F(£),  implies  that  for  all  i,  “today’s”  action  necessarily 
lies  in  “tomorrow’s”  best  response,  i.e., 

a*(f)  G  BRi(z-i(t  +  1)). 

By  the  rule  JSFP-1,  the  next  play  a,i(t  +  1)  =  a^t)  is  forced  for  all  i  G  (1, N}. 
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Now  define 


7 r(f;  a(t),z(t ))  :=  min  {r  >  0  :  (a(t  +  r),  z(t  +  r))  f  T(t  +  r)}  .  (3.6) 

If  this  is  never  satisfied,  then  set  n (t;  a(t),z(t ))  =  oo. 

For  the  sake  of  notational  simplicity,  we  will  drop  the  explicit  dependence  on  a(t) 
and  z(t)  and  simply  write  7r(t)  instead  of  n{t;  a{t),  z{t)). 

A  consequence  of  the  definition  of  n{t)  is  that  for  a  given  a{t )  and  z(t),  1)  a{t ) 
must  be  repeated  over  the  interval  [t,  t  +  n  (£)].  Furthermore,  at  time  t  +  nit)  +  1,  at 
least  one  player  can  improve  (over  yet  another  repeated  play  of  a(t ))  by  playing  a  best 
response  at  time  t  +  n(t)  +  1.  Furthermore,  the  probability  that  exactly  one  player  will 
switch  to  a  best  response  action  at  time  t  +  7r(f)  +  1  is  at  least  e(l  —  e)1l~x . 

The  following  claim  shows  that  this  improvement  opportunity  remains  even  if  a{t ) 
is  repeated  for  longer  than  n(t)  (because  of  inertia). 

Claim  3.6.2.  Let  a{t )  and  z{t)  be  such  that  tt (t)  <  oo.  Let  t\  be  any  integer  satisfying 
7T (t)  <  t\  <  OO.  If 

a{t )  =  a(t  +  !)  =  •••  =  a(f  +  7r  (f))  =  •  •  •  =  a(f  +  ti), 


then 


aft )  f  BRfz-ft  +  ti  +  1)),  for  some  i  G  (1,  ...,n}. 


Proof.  Let  i  e  (1, ...,  n}  be  such  that 

aft)  f  BRfz-ft  +  7 r(f)  +  1)) 


and 

aft )  G  BRfz-ft  +  7T (f))). 
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The  existence  of  such  an  i  is  assured  by  the  definition  of  n(t).  Pick  a*  G  BRfz-ft  + 
n(t)  +  1)).  We  have 


Ui(di,  Z—i(t  +  7r (f)  +  1))  -  Ui(ai(t),z-i(t  +  7r(f)  +  1)) 

=  [Ui(di,  z—i(t  +  vr (t)))  -  £/i(ai(i),  2_i(t  +  vr(f)))]  ^ 

t  +  7 T(t)  +  1 

+  [Ui(di,  a-iit ))  -  £/i(ai(i),  a_*(f))]  j-  >  0. 

t  +  7r(rj  +  1 

Since  aft)  G  BRfz-ft  +  7r (£))),  we  must  have 

Ui(di,a-i(t ))  -Ufaft),a_ft))  >  0. 


This  implies 


Ui(di,  z—i(t  +  ti  +  1))  —  Ufaft),  z—i(t  +  ti  +  1)) 

=  [C/j(ai,  +  7r (f)  +  1))  -  C/j(ai(f),  +  7r(f)  +  1))] 

+  -  [/.(a.(£),a_.(£))]  j1  ^  >  0. 

t  +  1 1  +  1 


t  +  7T  (f)  +  1 
t  +  fi  +  1 


□ 


Claim  3.6.3.  If,  at  any  time,  a(t )  is  not  an  equilibrium,  then  n(t)  <  7 1. 

Proof.  Let  a0  :=  a{t).  Since  a0  is  not  an  equilibrium, 

af  f  BRfa for  some  i  G  {1, 

Pick  di  G  BRfaff  so  that  C/j(dj,  a°_  f  —  C/j(a°,  >  m„.  If 

a(t)  =  a{t  +  !)  =  ■••  =  a(f  +  7!)  =  a°, 


then 


Ui(di:  z—ft  +  7 1  +  1))  -  Cj(a°,  +  7f  +  1)) 

_  t[Ufdhz_ft))  -  Ufa®,  Z-j{t))\  +  (7t  +  l)[£/j(ai,a°  J  -  Ufa^c^_J\ 

t  +  'yt  +  l 

>  —tMu  +  (7^  +  l)mM 
—  t  +  7f  +  1 

>  0. 


50 


□ 

Claim  3.6.4.  Consider  a  finite  generalized  ordinal  potential  game  with  a  potential 
function  <f>f)  with  player  utilities  satisfying  Assumption  3.2.2.  For  any  time  t  >  0, 
suppose  that 

1.  aft )  is  not  an  equilibrium;  and 

2.  maxi<j<n  6ft)  <  5  for  some  5  <  t. 

Define 

ipft)  :=  1  +  max  [it  ft),  7$}. 

Then  ipft)  <  1  +  7 1  and 

Pr  [4>(a(t  +  ipft)))  >  <f>(a(t))  \  aft),  zft )]  >  e(l  -  £)n(1+75)_1, 

and 

max  8ft  +  'fit))  <  1  +  (1  +  7)5. 

l<i<n 

Proof.  Since  aft)  is  not  an  equilibrium,  Claim  3.6.3  implies  that  7 rft)  <  7 1,  which  in 
turn  implies  the  above  upper  bound  on  ip(t). 

First  consider  the  case  where  n(t)  >  7 S,  i.e.,  -fi(t)  =  1  +  7r (t).  According  to  the 
definition  of  7r(t)  in  equation  (3.6),  aft )  must  be  repeated  as  a  best  response  in  the 
period  [ t ,  t  +  "it)].  Furthermore,  we  must  have 

max  Sift  +  fi>ft))  <  1 

l<i<n 

and  aft)  f  BRfz-ft  +  ip(t)))  for  at  least  one  player  V, .  The  probability  that  exactly 
one  such  player  V,  will  switch  to  a  choice  different  than  aft)  at  time  t  +  fft)  is  at 
least  e(l  —  e)n_1.  But,  by  Claim  3.6.1  and  no-indifference  Assumption  3.2.2,  such  an 
event  would  cause 

Ufaft  +  7r(t)  +  1))  >  Ufaft))  fi(aft  +  n ft)  +  1))  >  fi(aft)). 
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Now  consider  the  case  where  n(t)  <  'yd,  i.e.,  ip(t)  —  1  +  'yd.  In  this  case, 

max  di(t  +  "0(t))  <1  +  75  +  5. 

l<i<n 

Moreover,  the  event 

a(t)  =  ■  ■  ■  =  a{t  +  7  5) 

will  occur  with  probability  at  least8  (1  —  e)n7<5.  Conditioned  on  this  event,  Claim  3.6.2 
provides  that  exactly  one  player  Vl  will  switch  to  a  choice  different  than  a,  ( t )  at  time 
t  +  w it)  with  probability  at  least  e(l  —  e)n_1.  By  Claim  3.6.1  and  no-indifference 
Assumption  3.5,  this  would  cause 

Ui(a(t  +  ip(t)))  >  Ui(a(t ))  c/)(a(t  +  i/)(t)))  >  <f>(a(t)). 


□ 


Proof  of  Theorem  3.2.1 

It  suffices  to  show  that  there  exists  a  non- zero  probability,  s*  >  0,  such  that  the  follow¬ 
ing  statement  holds.  For  any  t  >  0,  a(t)  G  A,  and  z(t)  G  A(+l),  there  exists  a  finite 
time  t*  >t  such  that,  for  some  equilibrium  a*, 

Pr  [a(r)  =  a*,  Vr  >  t*  \  a(t ),  {z_j(t)}”=  J  >  e*.  (3.7) 

In  other  words,  the  probability  of  convergence  to  an  equilibrium  by  time  t*  is  at  least 
e*.  Since  +  does  not  depend  on  t,  a(t),  or  z(t),  this  will  imply  that  the  action  profile 
converges  to  an  equilibrium  almost  surely. 

We  will  construct  a  series  of  events  that  can  occur  with  positive  probability  to 
establish  the  bound  in  equation  (3.7). 

8In  fact,  a  tighter  bound  can  be  derived  by  exploiting  the  forced  moves  for  a  duration  of  7r(f). 
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Let  to  —  t  +  1.  All  players  will  play  a  best  response  at  time  to  with  probability  at 
least  en.  Therefore,  we  have 

Pr  max  5j(f0)  =  0  |  a(t),  {z_i(t)}™=l  >  £n.  (3.8) 

l<i<n 

Assume  that  a(to)  is  not  an  equilibrium.  Otherwise,  according  to  Proposition  3.2.2, 

a(r)  =  a(to)  for  all  r  >  t0. 

From  Claim  3.6.4,  define  4  and  S]  as 

5i  :  =  l  +  (l  +  7)<50, 

4  ■■=  t0  +  1  +  max{7r(f0),  7^o}, 

<  t0  +  1  +  7to  =  1  +  (1  +  j)to, 

where  <50  :=  0.  By  Claim  3.6.4, 

Pr  [<f>(a(t- 1))  >  </>(a(t0))  |  a(t0),  {z-i(t0)}?=  J  >  e(l  -  £)n(1+7'5o)-1 

and 

max  5Ati)  <  5i. 

1  <i<n 

Similarly,  for  k  >  0  we  can  recursively  define 

5k  1  +  (1  +  7)^fc-l> 

fc-1 

=  (1  +  7)% +  £(1  +  7)*, 

j= 0 

fc-1 

=  X1(1+7)J’ 

3=0 

and 

4  :=  4_i  +  1  +  max{vr(4_i),  74_i}, 

<  1  +  (1  +  7)4-1 

fc-i 

<  (l+7)%  +  ^(l  +  7)j, 

3=0 
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where 


Pr  [0(a(4))  >  <f>(a(tk- 1))  |  a(tk- 1),  {z_i(4-i)}?=i]  >  e(l  -  £)n(i+74-i)  1 

and 

max  5j(4)  <  4, 

1  <i<n 

as  long  as  a (4- i)  is  not  an  equilibrium. 

Therefore,  one  can  construct  a  sequence  of  profiles  a(t0),  a(4),  •••,  a(4)  with  the 
property  that  <p(a(t0))  <  0(a(4))  <  •••  <  0(a(4))-  Since  in  a  finite  generalized 
ordinal  potential  game,  0(a(4))  cannot  increase  indefinitely  as  k  increases,  we  must 
have 

1-41-1 

Pr  [a(4)  is  an  equilibrium  for  some  tk  G  [t,  oo)  |  a(f),  {z_i(£)}"=1]  >  £n  e(l  —  £)"(1+7l5,s)_1, 

k- o 

where  e"  comes  from  (3.8).  Finally,  from  Claim  3.6.1  and  Assumption  3.2.2,  the 
above  inequality  together  with 

Pr  [a(4)  =  •  •  •  =  a (4  +  7 4)  I  a(4),  {z_i(4)}?=i]  >  (1  —  e)”74  >  (1  -  e)”7'5'-41 

implies  that  for  some  equilibrium,  a*, 

Pr  [a(r )  =  a*,  Vr  >  t*  \  a(t),  {z-i(t)}™=1\  >  £*, 

where 

41 

4  =  t|^|  +jS\A\  +  1  =  (1  +  ~f)lA{t0  +  +  7)J> 

j=o 

4l-i  \  / 

JJ  e(l  -  e)«(1+44-i  1(1-  e)n4i^i 
fc=0  7  s 

Since  A  does  not  depend  on  t  this  concludes  the  proof.  □ 
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CHAPTER  4 


Regret  Based  Dynamics  for  Weakly  Acyclic  Games 

No-regret  algorithms  have  been  proposed  to  control  a  wide  variety  of  multi-agent  sys¬ 
tems.  The  appeal  of  no-regret  algorithms  is  that  they  are  easily  implementable  in  large 
scale  multi-agent  systems  because  players  make  decisions  using  only  retrospective  or 
“regret  based”  information.  Furthermore,  there  are  existing  results  proving  that  the  col¬ 
lective  behavior  will  asymptotically  converge  to  a  set  of  points  of  “no-regret”  in  any 
game.  We  illustrate,  through  a  simple  example,  that  no-regret  points  need  not  reflect 
desirable  operating  conditions  for  a  multi-agent  system.  Multi-agent  systems  often  ex¬ 
hibit  an  additional  structure  (i.e.  being  “weakly  acyclic”)  that  has  not  been  exploited 
in  the  context  of  no-regret  algorithms.  In  this  chapter,  we  introduce  a  modification  of 
the  traditional  no-regret  algorithms  by  (i)  exponentially  discounting  the  memory  and 
(ii)  bringing  in  a  notion  of  inertia  in  players’  decision  process.  We  show  how  these 
modifications  can  lead  to  an  entire  class  of  regret  based  algorithms  that  provide  almost 
sure  convergence  to  a  pure  Nash  equilibrium  in  any  weakly  acyclic  game. 

4.1  Introduction 

The  applicability  of  regret  based  algorithms  for  multi-agent  learning  has  been  stud¬ 
ied  in  several  papers  [Gor05,  Bow04,  KV05,  BP05,  GJ03,  AMS07].  The  appeal  of 
regret  based  algorithms  is  two  fold.  First  of  all,  regret  based  algorithms  are  easily 
implementable  in  large  scale  multi-agent  systems  when  compared  with  other  learning 
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algorithms  such  as  fictitious  play  [MS96a,  JGD01].  Secondly,  there  is  a  wide  range  of 
algorithms,  called  “no-regret”  algorithms,  that  guarantee  that  the  collective  behavior 
will  asymptotically  converge  to  a  set  of  points  of  no-regret  (also  referred  to  as  coarse 
correlated  equilibrium)  in  any  game  [You05].  A  point  of  no-regret  characterizes  a  sit¬ 
uation  for  which  the  average  utility  that  a  player  actually  received  is  as  high  as  the 
average  utility  that  the  player  “would  have”  received  had  that  player  used  a  different 
fixed  strategy  at  all  previous  time  steps.  No-regret  algorithms  have  been  proposed  in 
a  variety  of  settings  ranging  from  network  routing  problems  [BEL06]  to  structured 
prediction  problems  [Gor05]. 

In  the  more  general  regret  based  algorithms,  each  player  makes  a  decision  using 
only  information  regarding  the  regret  for  each  of  his  possible  actions.  If  an  algorithm 
guarantees  that  a  player’s  maximum  regret  asymptotically  approaches  zero  then  the  al¬ 
gorithm  is  referred  to  as  a  no-regret  algorithm.  The  most  common  no-regret  algorithm 
is  regret  matching  [HMOO].  In  regret  matching,  at  each  time  step,  each  player  plays  a 
strategy  where  the  probability  of  playing  an  action  is  proportional  to  the  positive  part 
of  his  regret  for  that  action.  In  a  multi-agent  system,  if  all  players  adhere  to  a  no-regret 
learning  algorithm,  such  as  regret  matching,  then  the  group  behavior  will  converge 
asymptotically  to  a  set  of  points  of  no-regret  [HMOO].  Traditionally,  a  point  of  no¬ 
regret  has  been  viewed  as  a  desirable  or  efficient  operating  condition  because  each 
player’s  average  utility  is  as  good  as  the  average  utility  that  any  other  action  would 
have  yielded  [KV05].  However,  a  point  of  no-regret  says  little  about  the  performance; 
hence  knowing  that  the  collective  behavior  of  a  multi-agent  system  will  converge  to  a 
set  of  points  of  no-regret  in  general  does  not  guarantee  an  efficient  operation. 

There  have  been  attempts  to  further  strengthen  the  convergence  results  of  no-regret 
algorithms  for  special  classes  of  games.  For  example,  in  [JGD01],  Jafari  et  al.  showed 
through  simulations  that  no-regret  algorithms  provide  convergence  to  a  Nash  equilib- 
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rium  in  dominance  solvable,  constant-sum,  and  general  sum  2x2  games.  In  [Bow04], 
Bowling  introduced  a  gradient  based  regret  algorithm  that  guarantees  that  players’ 
strategies  converge  to  a  Nash  equilibrium  in  any  2  player  2  action  repeated  game. 
In  [BEL06],  Blum  et  al.  analyzed  the  convergence  of  no-regret  algorithms  in  routing 
games  and  proved  that  behavior  will  approach  a  Nash  equilibrium  in  various  settings. 
However,  the  classes  of  games  considered  here  cannot  fully  model  a  wide  variety  of 
multi-agent  systems. 

It  turns  out  that  weakly  acyclic  games,  which  generalize  potential  games  [MS96b], 
are  closely  related  to  multi-agent  systems  [MAS07a].  The  connection  can  be  seen  by 
recognizing  that  in  any  multi-agent  system  there  is  a  global  objective.  Each  player 
is  assigned  a  local  utility  function  that  is  appropriately  aligned  with  the  global  objec¬ 
tive.  It  is  precisely  this  alignment  that  connects  the  realms  of  multi-agent  systems  and 
weakly  acyclic  games. 

An  open  question  is  whether  no-regret  algorithms  converge  to  a  Nash  equilibrium 
in  n-player  weakly  acyclic  games.  In  this  chapter,  we  introduce  a  modification  of  the 
traditional  no-regret  algorithms  that  (i)  exponentially  discounts  the  memory  and  (ii) 
brings  in  a  notion  of  inertia  in  players’  decision  process.  We  show  how  these  modifi¬ 
cations  can  lead  to  an  entire  class  of  regret  based  algorithms  that  provide  almost  sure 
convergence  to  a  pure  Nash  equilibrium  in  any  weakly  acyclic  game.  It  is  important 
to  note  that  convergence  to  a  Nash  equilibrium  also  implies  convergence  to  a  no-regret 
point. 

In  Section  4.2  we  discuss  the  no-regret  algorithm,  “regret  matching,”  and  illustrate 
the  performance  issues  involved  with  no-regret  points  in  a  simple  3  player  identical 
interest  game.  In  Section  4.3  we  introduce  a  new  class  of  learning  dynamics  referred 
to  as  regret  based  dynamics  with  fading  memory  and  inertia.  In  Section  4.4  we  present 
some  simulation  results.  Section  4.5  presents  some  concluding  remarks. 
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4.2  Regret  Matching 


We  consider  a  repeated  matrix  game  with  ri- player  set  V  :=  {V\ , ....  Vn},  a  finite 
action  set  A,  for  each  player  V,  G  V .  and  a  utility  function  :  A  — >  M.  for  each 
player  P,  G  P,  where  Gl  :=  Gli  x  •  •  •  x  A„. 

We  introduce  regret  matching,  from  [HMOO],  in  which  players  choose  their  actions 
based  on  their  regret  for  not  choosing  particular  actions  in  the  past  steps. 

Define  the  average  regret  of  player  V,  for  an  action  a,  G  Ai  at  time  t  as 

1  t_1 

R?(t)  :=tE  (UiK  a_,(r))  -  f/*(a(r))) .  (4.1) 

T— 0 

In  other  words,  player  P,’s  average  regret  for  a*  G  A,  would  represent  the  average 
improvement  in  his  utility  if  he  had  chosen  a*  G  A,  in  all  past  steps  and  all  other 
players’  actions  had  remained  unaltered. 

Each  player  V,  using  regret  matching  computes  /?■'' (t)  for  every  action  a*  G  A, 
using  the  recursion 

K?(t)  =  -  1)  +  ^  (Ui(a^  a_i(t))  -  UMt ))) . 

Note  that,  at  every  step  t  >  0,  player  V,  updates  all  entries  in  his  average  regret 
vector  Ri(t)  :=  [Rii{t)]a.eA  -  To  update  his  average  regret  vector  at  time  t,  it  is 
sufficient  for  player  Vt  to  observe  (in  addition  to  the  actual  utility  received  at  time 
t  —  1,  Ui(a(t  —  1)))  his  hypothetical  utilities  [/,;(«„  a_j(f  —  1)),  for  all  G  At,  that 
would  have  been  received  if  he  had  chosen  a,  (instead  of  at{t  —  1))  and  all  other  player 
actions  a_j(f  —  1)  had  remained  unchanged  at  step  t  —  1. 

In  regret  matching,  once  player  Vt  computes  his  average  regret  vector,  R,(t),  he 
chooses  an  action  a,i(t),  t  >  0,  according  to  the  probability  distribution  pi(t )  defined 
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for  any  a*  G  At,  provided  that  the  denominator  above  is  positive;  otherwise,  pi(t)  is  the 
uniform  distribution  over  Ai  (p*(0)  G  A(Ai)  is  always  arbitrary).  Roughly  speaking, 
a  player  using  regret  matching  chooses  a  particular  action  at  any  step  with  probability 
proportional  to  the  average  regret  for  not  choosing  that  particular  action  in  the  past 
steps.  If  all  players  use  regret  matching,  the  empirical  distribution  of  the  joint  actions 
converge  almost  surely  to  the  set  of  coarse  correlated  equilibria  (similar  results  hold 
for  different  regret  based  adaptive  dynamics);  see  [HMOO,  HM01,  HM03a].  Note  that 
this  does  not  mean  that  the  action  profiles  a(t)  will  converge,  nor  does  it  mean  that  the 
empirical  frequencies  of  a(t )  will  converge  to  a  point  in  A  (A). 

4.2.1  Coarse  Correlated  Equilibria  and  No-Regret 

The  set  of  coarse  correlated  equilibrium  has  a  strong  connection  to  the  notion  of  regret. 

We  will  restate  the  definitions  of  the  joint  and  marginal  empirical  frequencies  orig¬ 
inally  defined  in  Section  3.2.  Define  the  empirical  frequency  of  the  joint  actions,  za(t), 
as  the  percentage  of  stages  at  which  all  players  chose  the  joint  action  profile  a  G  A  up 
to  time  t  —  1,  i.e., 

za(t)  ■=  |'^/{a(r)  = 

T— 0 

Let  z(t)  denote  the  empirical  frequency  vector  formed  by  the  components 
{za(t)}a£_A-  Note  that  the  dimension  of  z(t)  is  the  cardinality  of  the  set  A,  i.e.,  |^4|, 
and  z(t)  G  A  (,4). 

Similarly,  let  A 7’  (t)  be  the  percentage  of  stages  at  which  players  other  then  player 
Vi  have  chosen  the  joint  action  profile  a_t  G  A~t  up  to  time  t  —  1,  i.e., 

1 

za_X{t)  :=  -  /{«-i(r)  =  a-;},  (4.2) 

T— 0 
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which,  given  z(t),  can  also  be  expressed  as 


ztrm  =  Y. z 

eAi 


y(a,i,a-i ) 


(t). 


Let  z-i(t)  denote  the  empirical  frequency  vector  formed  by  the  components  {zaLii(t)}a_.eA 
Note  that  the  dimension  of  Z-i(t)  is  the  cardinality  \A-i\  and  Z-i(t)  G  A(A_j). 

Given  a  joint  distribution  z(t),  the  expected  utility  of  player  V,  is 


U,(z(t))  =  (“)*“«. 

a£A 

T— 0 

which  is  precisely  the  average  utility  that  player  Vt  has  received  up  to  time  t  —  1.  The 
expected  utility  of  player  V,  for  any  action  a  ,  e  A,  is 


Ui(ai,z-i(t))  =  ^2  Ui(a,i,  a-i) (t) i 

1  t_1 

=  t  ^£/*(ai,a_i(r)), 

T— 0 

which  is  precisely  the  average  utility  that  player  V1:  would  have  received  up  to  time 
t  —  1  if  player  V,  had  played  action  a,  all  previous  time  periods  provided  that  the  other 
players  actions  remained  unchanged.  Therefore,  the  regret  of  player  Vi  for  action 
an  G  A,  at  time  t  can  be  expressed  as 


RT{t)  =  Ul(ai,Z-i(t))-Ul{z{t)). 


If  all  players  use  regret  matching,  then  we  know  that  the  empirical  frequency  z(t ) 
of  the  joint  actions  converges  almost  surely  to  the  set  of  coarse  correlated  equilibria.  If 
z(t)  is  a  coarse  correlated  equilibrium,  then  we  know  that  for  any  player  V,  G  V  and 
any  action  a,:  G  Ai, 

Ui(ai,z-i(t ))  <  Ui(z(t ))  RV(t)  ^  0- 
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Therefore,  stating  that  the  empirical  frequency  of  the  joint  actions  converge  to  the  set 
of  coarse  correlated  equilibria  is  equivalent  to  saying  that  a  player’s  average  regret  for 
any  action  will  asymptotically  vanish. 

4.2.2  Illustrative  Example 

In  general,  the  set  of  Nash  equilibria  is  a  proper  subset  of  the  set  of  coarse  correlated 
equilibria.  Consider  for  example  the  following  3— player  identical  interest  game  char¬ 
acterized  by  the  player  utilities  shown  in  Figure  4.1. 
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Figure  4.1:  A  3— player  Identical  Interest  Game. 


Player  V\  chooses  a  row  U  or  D ,  Player  P2  chooses  a  column  L  or  R,  Player  P3 
chooses  a  matrix  Mi,  or  M2,  or  M3.  There  are  two  pure  Nash  equilibria  (U,L,  Mi) 
and  (. D ,  R,  M3 )  both  of  which  yield  maximum  utility  2  to  all  players.  The  set  of  coarse 
correlated  equilibria  contains  these  two  pure  Nash  equilibria  as  the  extremum  points 
of  A(A)  as  well  as  many  other  probability  distributions  in  A (*4.).  In  particular,  the  set 
of  coarse  correlated  equilibria  contains  the  following 

za  —  1  zULM 2  =  ~DRM 2  zURM 2  =  zDLAh 

a£A:a3=M2 

Any  coarse  correlated  equilibrium  of  this  form  yields  an  expected  utility  of  0  to  all 
players.  Clearly,  one  of  the  two  pure  Nash  equilibria  would  be  more  desirable  to  all 


jz  e  A  (,4)  : 
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players  then  any  other  outcome  including  the  above  coarse  correlated  equilibria.  How¬ 
ever,  the  existing  results  at  the  time  of  writing  this  dissertation  such  as  Theorem  3.1  in 
[You05]  only  guarantee  that  regret  matching  will  lead  players  to  the  set  of  coarse  cor¬ 
related  equilibria  and  not  necessarily  to  a  pure  Nash  equilibrium.  While  this  example 
is  simplistic  in  nature,  one  must  believe  that  situations  like  this  could  easily  arise  in 
more  general  weakly  acyclic  games. 

We  should  emphasize  that  regret  matching  could  indeed  be  convergent  to  a  pure 
Nash  equilibrium  in  weakly  acyclic  games;  however,  to  the  best  of  authors’  knowledge, 
no  proof  for  such  a  statement  exists.  The  existing  results  characterize  the  long-term 
behavior  of  regret  matching  in  general  games  as  convergence  to  the  set  of  coarse  cor¬ 
related  equilibria,  whereas  we  are  interested  in  proving  that  the  action  profiles,  a(k), 
generated  by  regret  matching  will  converge  to  a  pure  Nash  equilibrium  when  player 
utilities  constitute  a  weakly  acyclic  game,  an  objective  which  we  will  pursue  in  the 
next  section. 

4.3  Regret  Based  Dynamics  with  Fading  Memory  and  Inertia 

To  enable  convergence  to  a  pure  Nash  equilibrium  in  weakly  acyclic  games,  we  will 
modify  the  conventional  regret  based  dynamics  in  two  ways.  First,  we  will  assume 
that  each  player  has  a  fading  memory,  that  is,  each  player  exponentially  discounts 
the  influence  of  its  past  regret  in  the  computation  of  its  average  regret  vector.  More 
precisely,  each  player  computes  a  discounted  average  regret  vector  according  to  the 
recursion 


+  1)  =  (1  —  p)Rf(t)  +  P  (Ufa,  a-i(t))  -  Ui(a(t ))) , 

for  all  a*  G  At,  where  p  e  (0, 1]  is  a  parameter  with  1  —  p  being  the  discount  factor, 
and  R^il)  =  0. 
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Second,  we  will  assume  that  each  player  chooses  an  action  based  on  its  discounted 
average  regret  using  some  inertia.  Therefore,  each  player  Vi  chooses  an  action  0,  (1), 
at  step  t  >  1,  according  to  the  probability  distribution 

ai{t)RBi{R.{t))  +  (1  -  ai(t))wa^-l\ 

where  «j(t)  is  a  parameter  representing  player  Vi’s  willingness  to  optimize  at  time 
t,  vai(t~1')  is  the  vertex  of  A(Ai)  corresponding  to  the  action  aj(f  —  1)  chosen  by 
player  V,  at  step  t  —  1,  and  RBi  :  — >  A(Ai)  is  any  continuous  function  (on 

{x  G  :  [x]+  7^  0})  satisfying 

xl  >  0  RB-(x)  >  0 

and  (4.3) 

M+  =  0  =>  RBf(x)  =  Vf, 

where  xl  and  R,Bf(x)  are  the  (-th  components  of  x  and  RB,(x)  respectively. 

We  will  call  the  above  dynamics  regret  based  dynamics  (RB)  with  fading  memory 
and  inertia.  One  particular  choice  for  the  function  RBi  is 

[s*]  + 

l-TI  Ui 

m=  1 

which  leads  to  regret  matching  with  fading  memory  and  inertia.  Another  particular 
choice  is 

RB-(x)  =  T  x  I{xe  >  0},  (when  [x]+  ^  0), 

}2xm>0erx 

where  r  >  0  is  a  parameter.  Note  that,  for  small  values  of  r,  player  Vt  would  choose, 
with  high  probability,  the  action  corresponding  to  the  maximum  regret.  This  choice 
leads  to  a  stochastic  variant  of  an  algorithm  called  Joint  Strategy  Fictitious  Play  with 
fading  memory  and  inertia;  see  Section  3.3.  Also,  note  that,  for  large  values  of  r, 
player  V,  would  choose  any  action  having  positive  regret  with  equal  probability. 


(when  [x]+  7^  0) 


(4.4) 
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According  to  these  rules,  player  V,  will  stay  with  his  previous  action  aft  —  1) 
with  probability  1  —  aft)  regardless  of  his  regret.  We  make  the  following  standing 
assumption  on  the  players’  willingness  to  optimize. 

Assumption  4.3.1.  There  exist  constants  e  and  e  such  that 

0  <  £  <  aft )  <  e  <  1 

for  cdl  steps  t  >  1  and  for  all  i  G  {1, ...,  n). 

This  assumption  implies  that  players  are  always  willing  to  optimize  with  some 
nonzero  inertia1.  A  motivation  for  the  use  of  inertia  is  to  instill  a  degree  of  hesitation 
into  the  decision  making  process  to  ensure  that  players  do  not  overreact  to  various 
situations.  We  will  assume  that  no  player  is  indifferent  between  distinct  strategies  2. 

Assumption  4.3.2.  Player  utilities  satisfy 

Ui (otj ,  O— *)  f  Ui (otj ,  ci— j) ,  V  cij ,  a j  G  -A-ii  Oj  f  a j ,  V  a—i  G  j,  Vi  G  { 1, . .. , n} . 

The  following  theorem  establishes  the  convergence  of  regret  based  dynamics  with 
fading  memory  and  inertia  to  a  pure  Nash  equilibrium. 

Theorem  4.3.1.  In  any  weakly  acyclic  game  satisfying  Assumption  4.3.2,  the  action 
profiles  aft )  generated  by  regret  based  dynamics  with  fading  memory  and  inertia  sat¬ 
isfying  Assumption  4.3.1  converge  to  a  pure  Nash  equilibrium  almost  surely. 

We  provide  a  complete  proof  for  the  above  result  in  the  Appendix  of  this  chapter. 

We  note  that,  in  contrast  to  the  existing  weak  convergence  results  for  regret  matching 

in  general  games,  the  above  result  characterizes  the  long-term  behavior  of  regret  based 

dynamics  with  fading  memory  and  inertia,  in  a  strong  sense,  albeit  in  a  restricted  class 

of  games.  We  next  numerically  verify  our  theoretical  result  through  some  simulations. 

1This  assumption  can  be  relaxed  to  holding  for  sufficiently  large  t,  as  opposed  to  all  t. 

2One  could  alternatively  assume  that  all  pure  Nash  equilibrium  are  strict. 
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4.4  Simulations 


4.4.1  Three  Player  Identical  Interest  Game 

We  extensively  simulated  the  RB  iterations  for  the  game  considered  in  Figure  4.1.  We 
used  the  RBi  function  given  in  (4.4)  with  inertia  factor  a  =  0.5  and  discount  factor 
p  =  0.1.  In  all  cases,  player  action  profiles  a(t)  converged  to  one  of  the  pure  Nash 
equilibria  as  predicted  by  our  main  theoretical  result.  A  typical  simulation  run  shown 
in  Figure  4.2  illustrates  the  convergence  of  RB  iterations  to  the  pure  Nash  equilibrium 

(. d,r,m3 ). 
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time  step:  t 


M 

3 

a3(t)  M2 
M 

i 

0  50  100  150  200  250  300 

time  step:  t 


Figure  4.2:  Evolution  of  the  actions  of  players  using  RB. 
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4.4.2  Distributed  Traffic  Routing 


We  consider  a  simple  congestion  game,  as  defined  in  Section  2.3.3,  with  100  players 
seeking  to  traverse  from  node  A  to  node  B  along  5  different  parallel  roads  as  illustrated 
in  Figure  4.3.  Each  player  can  select  any  road  as  a  possible  route.  In  terms  of  conges- 

Road  1 


Figure  4.3:  Regret  Based  Dynamics  with  Inertia:  Congestion  Game  Example  -  Network  Topol¬ 
ogy 

tion  games,  the  set  of  resources  is  the  set  of  roads,  TZ,  and  each  player  can  select  one 
road,  i.e.,  Ai  =  TZ. 

We  will  assume  that  each  road  has  a  linear  cost  function  with  positive  (randomly 
chosen)  coefficients, 

cri(k)  =  dik  +  bi,  i  =  1,  -,5, 

where  k  represent  the  number  of  vehicles  on  that  particular  road.  This  cost  function 
may  represent  the  delay  incurred  by  a  driver  as  a  function  of  the  number  of  other  drivers 
sharing  the  same  road.  The  actual  coefficients  or  structural  form  of  the  cost  function 
are  unimportant  as  we  are  just  using  this  example  as  an  opportunity  to  illustrate  the 
convergence  properties  of  the  proposed  regret  based  algorithms. 

We  simulated  a  case  where  drivers  choose  their  initial  routes  randomly,  and  every 
day  thereafter,  adjusted  their  routes  using  the  regret  based  dynamics  with  the  RBi 
function  given  in  (4.4)  with  inertia  factor  a  =  0.85  and  discount  factor  p  =  0.1.  The 
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number  of  vehicles  on  each  road  fluctuates  initially  and  then  stabilizes  as  illustrated  in 
Figure  4.4.  Figure  4.5  illustrates  the  evolution  of  the  congestion  cost  on  each  road.  One 
can  observe  that  the  congestion  cost  on  each  road  converges  approximately  to  the  same 
value,  which  is  consistent  with  a  Nash  equilibrium  with  large  number  of  drivers.  This 
behavior  resembles  an  approximate  “Wardrop  equilibrium”  [War52],  which  represents 
a  steady-state  situation  in  which  the  congestion  cost  on  each  road  is  equal  due  to  the 
fact  that,  as  the  number  of  drivers  increases,  the  effect  of  an  individual  driver  on  the 
traffic  conditions  becomes  negligible. 


Figure  4.4:  Regret  Based  Dynamics  with  Inertia:  Evolution  of  Number  of  Vehicles  on  Each 
Route 

We  would  like  to  note  that  the  simplistic  nature  of  this  example  was  solely  for 
illustrative  purposes.  Regret  based  dynamics  could  be  employed  on  any  congestion 
game  with  arbitrary  network  topology  and  congestion  functions.  Furthermore,  well 
known  learning  algorithms  such  as  fictitious  play  [MS96a]  could  not  be  implemented 
even  on  this  very  simple  congestion  game.  A  driver  using  fictitious  play  would  need 
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Figure  4.5:  Regret  Based  Dynamics  with  Inertia:  Evolution  of  Congestion  Cost  on  Each  Route 

to  track  the  empirical  frequencies  of  the  choices  of  the  99  other  drivers  and  compute 
an  expected  utility  evaluated  over  a  probability  space  of  dimension  5". 

We  would  also  like  to  note  that  in  a  congestion  game,  it  may  be  unrealistic  to 
assume  that  players  are  aware  of  the  congestion  function  on  each  road.  This  implies 
that  each  driver  is  unaware  of  his  own  utility  function.  However,  even  in  this  setting, 
regret  based  dynamics  can  be  effectively  employed  under  the  condition  that  each  player 
can  evaluate  congestion  levels  on  alternative  routes.  On  the  other  hand,  if  a  player 
is  only  aware  of  the  congestion  experienced,  then  one  would  need  to  examine  the 
applicability  of  payoff  based  algorithms  [MYA07]  which  will  be  discussed  in  detail  in 
the  following  chapter. 
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4.5  Concluding  Remarks  and  Future  Work 


In  this  chapter  we  analyzed  the  applicability  of  regret  based  algorithms  on  multi-agent 
systems.  We  demonstrated  that  a  point  of  no-regret  may  not  necessarily  be  a  desirable 
operating  condition.  Furthermore,  the  existing  results  on  regret  based  algorithms  do 
not  preclude  these  inferior  operating  points.  Therefore,  we  introduced  a  modification 
of  the  traditional  no-regret  algorithms  that  (i)  exponentially  discounts  the  memory  and 
(ii)  brings  in  a  notion  of  inertia  in  players’  decision  process.  We  showed  how  these 
modifications  can  lead  to  an  entire  class  of  regret  based  algorithms  that  provide  con¬ 
vergence  to  a  pure  Nash  equilibrium  in  any  weakly  acyclic  game.  We  believe  that 
similar  results  hold  for  no-regret  algorithms  without  fading  memory  and  inertia  but 
thus  far  the  proofs  have  been  elusive. 

4.6  Appendix  to  Chapter  4 

4.6.1  Proof  of  Theorem  4.3.1 

We  will  first  state  and  prove  a  series  of  claims.  The  first  claim  states  that  if  at  any  time 
a  player  plays  an  action  with  positive  regret,  then  the  player  will  play  an  action  with 
positive  regret  at  all  subsequent  time  steps. 

Claim  4.6.1.  Fix  any  f0  >  1.  Then, 

R?{to\t0)  >  0  =>  R“i{t\t)  >  0 


for  all  t  >  t0. 

Proof.  Suppose  R^to\t0)  >  0.  We  have 

Rfto\t0  +  !)  =  (!  —  p)R.i{to\t0 )  >  0. 
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If  a* (f0  +  1)  —  ai(tQ),  then 

R.  l(t0+1\t o  +  1)  =  R“i{to\t0  +  1)  >  0. 

If  a*(f0  +  1)  f  a,i(t0),  then 

Rfto+1\t o  +  l)  >  0. 

The  argument  can  be  repeated  to  show  that  R^\t)  >  0,  for  all  t  >  t0.  □ 

Define 

Mu  :=  max{Ui(a)  :  a  G  A,  Vi  G  P}, 
rriu  :=  min{t/j(a)  :  a  G  A,  Vi  G  P}, 

5  :=  mindP^a1)  —  Pj(a2)|  >  0  : 

a1,  a2  G  ,4,  aP  =  a2_t,  V,  G  P}, 

IV  :=  min{n  G  {1,2,...}  : 

(1  -  (1  -  p)n)5  -  (1  -  p)n(Mu  -  mu)  >  5/2}, 

/  :=  min{PP"1(x)  :  <  ATU  —  mu,  Vf, 

a;m  >  5/2,  for  one  m,  VP,:  G  P}. 

Note  that  5,  /  >  0,  and  Rf  (f)  |  <  Mu  —  mu,  for  all  P,  G  P,  a,  G  .A,,  f  >  1. 

The  second  claim  states  a  condition  describing  the  absorptive  properties  of  a  strict 
Nash  equilibrium. 

Claim  4.6.2.  Fix  to  >  1.  Assume 

1.  a(t0)  is  a  strict  Nash  equilibrium,  and 

2.  Rfto)(to)  >  0  for  all  Vi  G  V,  and 

3.  a(t0)  =  a(t0  +  1)  =  ...  =  a(t0  +  N  —  1). 
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Then,  a(t )  =  a(to),  for  all  t  >  t0. 


Proof.  For  any  V,  G  V  and  any  «,  G  A%,  we  have 

+  N)  = 

+  (l  —  (1  —  p)w)  ( Ui{ai ,  a_j(to)) 

—Ufafto),  a_j(i0)))  • 

Since  a(t0)  is  a  strict  Nash  equilibrium,  for  any  P,:  G  P  and  any  a*  G  A:,  a*  7^  aj(t0), 
we  have 

Pi(aj,a_j(f0))  -  Ui(ai(to),a-i(t0))  <  -5. 

Therefore,  for  any  P,  G  P  and  any  a,  G  Aj,  a*  7^  afto), 

R?(t0  +  N)  <  (l-p)N(Mu-mu)-(l-(l-p)N)5 
<  -5/2  <  0. 

We  also  know  that,  for  all  P,  G  P, 

P“i(to)(f0  +  iV)  =  (1  —  p)NR“i{to\t0 )  >  0. 

This  proves  the  claim.  □ 

The  third  claim  states  an  event,  and  associated  probability,  where  the  ensuing  joint 
action  is  a  better  response  to  the  current  joint  action  profile. 

Claim  4.6.3.  Fix  to  >  1.  Assume 

1.  a  (t0 )  ri  not  a  Nash  equilibrium,  and 

2.  a  (to)  =  a/to  +  1)  =  ...  =  a/to  +  N  —  1) 
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Let  a*  =  (a*,  a-fto))  be  such  that 

Ui(a^  i  Cl— fto))  Ui {fli (to)  ,  d— i(to) )  j 

for  some  Vi  G  P  and  som«  a*  G  ,4*.  Then,  Rf  (t0  +  N)  >  <5/2,  and  a*  will  be  chosen 
at  step  t0  +  N  with  at  least  probability  7  :=  (1  —  e)n_1e/. 

Proof.  We  have 

R?(t0  +  N)  >  -( l-p)N(Mu-mu)  +  (l-(l-p)N)5 
>  5/2. 

Therefore,  the  probability  of  player  Vi  choosing  a*  at  step  to  +  N  is  at  least  ef.  Be¬ 
cause  of  players’  inertia,  all  other  players  will  repeat  their  actions  at  step  t0  +  N  with 
probability  at  least  (1  —  e)n_1.  This  means  that  the  action  profile  a*  will  be  chosen  at 
step  to  +  N  with  probability  at  least  (1  —  e)n-1e/.  □ 

The  fourth  claim  identifies  a  particular  event,  and  associated  probability,  guar¬ 
anteeing  that  each  player  will  only  play  actions  with  positive  regret  as  discussed  in 
Claim  4.6.1. 

Claim  4.6.4.  Fix  t0  >  1.  We  have  Rf^\t)  >  0  for  all  t  >  t0  +  2  Nn  and  for  all 
Vi  G  V  with  probability  at  least 

Proof.  Let  a0  :=  a(t0).  Suppose  Rf  (f0)  <  0.  Furthermore,  suppose  that  a0  is  re¬ 
peated  N  consecutive  times,  i.e.  a(t0)  =  ...  =  a(t0  +  N  —  1)  =  a0,  which  occurs  with 
at  least  probability  at  least  (1  — 

If  there  exists  a  a*  =  (a*,aV)  such  that  Ufa*)  >  Ufa0),  then,  by  Claim  4.6.3, 
Rf  (t0  +  N)  >  5/2  and  a*  will  be  chosen  at  step  t0  +  N  with  at  least  probability  7. 
Conditioned  on  this,  we  know  from  Claim  4.6.1  that  Rf^f)  >  0  for  all  t  >  t0  +  N. 
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If  there  does  not  exist  such  an  action  a*,  then  Rf  (t0  +  N)  <0  for  all  a,  G  A,.  An 
action  profile  (af,  a°_{)  with  Ufaf,  a°_ f)  <  Ufa0)  will  be  chosen  at  step  t0  +  N  with  at 
least  probability  ^o(l— e)n_1.  If  a(t0+N)  =  (af,  and  if  furthermore  (af,  aff)  is 

repeated  N  consecutive  times,  i.e.,  a(t0  +  N)  —  ...  —  a(t0  +  2N  —  1),  which  happens 
with  probability  at  least  (1  —  then,  by  Claim  4.6.3,  Raf  (t0  +  2 N)  >  5/2 

and  the  action  profile  o°  will  be  chosen  at  step  (to  +  2N)  with  at  least  probability  7. 
Conditioned  on  this,  we  know  from  Claim  4.6.1  that  R^\t)  >  0  for  all  t  >  f0  +  2 N. 

In  summary,  Rafll  ](t)  >  0  for  all  t  >  t0  +  2N  with  at  least  probability 

We  can  repeat  this  argument  for  each  player  to  show  that  Rf^\ t)  >  0  for  all  times 
t  >  t0  +  2 Nn  and  for  all  V,  G  V  with  probability  at  least 

§]>-*)*■ 

□ 


FINAL  STEP:  Establishing  convergence  to  a  strict  Nash  equilibrium: 

Proof.  Fix  t0  >  1.  Define  t\  :=  t0  +  2 Nn.  Let  a1,  a2, . . . ,  aL  be  a  finite  sequence  of 
action  profiles  satisfying  the  conditions  given  in  Subsection  2.3.4  with  a1  :=  a(t\). 

Suppose  Rf^\t)  >  0  for  all  t  >  1 1  and  for  all  V,  G  V,  which,  by  Claim  4.6.4, 
occurs  with  probability  at  least 

fi>-e 

i=l  '  l' 

Suppose  further  that  o(ti)  =  ...  =  a(t\  +  N  —  1)  =  a1  which  occurs  with  at  least 
probability  (1  —  e)”^-1).  According  to  Claim  4.6.3  the  action  profile  a2  will  be  played 
at  step  t2  :=  ti  +  N  with  at  least  probability  7.  Suppose  now  a(t2)  =  ...  =  a(t2  + 
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N  —  1)  =  a2,  which  occurs  with  at  least  probability  (1  —  g)"^-1).  According  to 
Claim  4.6.3,  the  action  profile  a3  will  be  played  at  step  t3  :=  t2  +  N  with  at  least 
probability  7. 

We  can  repeat  the  above  arguments  until  we  reach  the  strict  Nash  equilibrium  aL 
at  step  tL  (recursively  defined  as  above)  and  stay  at  aL  for  N  consecutive  steps.  From 
Claim  2,  this  would  mean  that  the  action  profile  would  stay  at  aL  for  all  t  >  t g . 

Therefore,  given  t0  >  1,  there  exists  constants  e  >  0  and  T  >  0,  both  of  which  are 
independent  of  t0,  and  a  strict  Nash  equilibrium  a*,  such  that  the  following  event  hap¬ 
pens  with  at  least  probability  e:  ait)  =  a*  for  all  t  >  t0  +  T.  This  proves  Theorem  4.1. 

□ 
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CHAPTER  5 


Payoff  Based  Dynamics  for  Weakly  Acyclic  Games 

We  consider  repeated  multi-player  games  in  which  players  repeatedly  and  simulta¬ 
neously  choose  strategies  from  a  finite  set  of  available  strategies  according  to  some 
strategy  adjustment  process.  We  focus  on  the  specific  class  of  weakly  acyclic  games, 
which  is  particularly  relevant  for  multi-agent  cooperative  control  problems.  A  strat¬ 
egy  adjustment  process  determines  how  players  select  their  strategies  at  any  stage  as 
a  function  of  the  information  gathered  over  previous  stages.  Of  particular  interest 
are  “payoff  based”  processes,  in  which  at  any  stage,  players  only  know  their  own  ac¬ 
tions  and  (noise  corrupted)  payoffs  from  previous  stages.  In  particular,  players  do  not 
know  the  actions  taken  by  other  players  and  do  not  know  the  structural  form  of  payoff 
functions.  We  introduce  three  different  payoff  based  processes  for  increasingly  gen¬ 
eral  scenarios  and  prove  that  after  a  sufficiently  large  number  of  stages,  player  actions 
constitute  a  Nash  equilibrium  at  any  stage  with  arbitrarily  high  probability.  We  also 
show  how  to  modify  player  utility  functions  through  tolls  and  incentives  in  so-called 
congestion  games,  a  special  class  of  weakly  acyclic  games,  to  guarantee  that  a  central¬ 
ized  objective  can  be  realized  as  a  Nash  equilibrium.  We  illustrate  the  methods  with  a 
simulation  of  distributed  routing  over  a  network. 
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5.1  Introduction 


The  objective  in  distributed  cooperative  control  for  multi-agent  systems  is  to  enable 
a  collection  of  “self-interested”  agents  to  achieve  a  desirable  “collective”  objective. 
There  are  two  overriding  challenges  to  achieving  this  objective.  The  first  is  complexity: 
finding  an  optimal  solution  by  a  centralized  algorithm  may  be  prohibitively  difficult 
when  there  are  large  numbers  of  interacting  agents.  This  motivates  the  use  of  adaptive 
methods  that  enable  agents  to  “self  organize”  into  suitable,  if  not  optimal,  collective 
solutions. 

The  second  challenge  is  limited  information.  Agents  may  have  limited  knowledge 
about  the  status  of  other  agents,  except  perhaps  for  a  small  subset  of  “neighboring” 
agents.  An  example  is  collective  motion  control  for  mobile  sensor  platforms  (e.g., 
[GSM05]).  In  these  problems,  mobile  sensors  seek  to  position  themselves  to  achieve 
various  collective  objectives  such  as  rendezvous  or  area  coverage.  Sensors  can  com¬ 
municate  with  neighboring  sensors,  but  otherwise  do  not  have  global  knowledge  of  the 
domain  of  operation  or  the  status  and  locations  of  non-neighboring  sensors. 

A  typical  assumption  is  that  agents  are  endowed  with  a  reward  or  utility  function 
that  depends  on  their  own  strategies  and  the  strategies  of  other  agents.  In  motion 
coordination  problems,  for  example,  an  agent’s  utility  function  typically  depends  on 
its  position  relative  to  other  agents  or  environmental  targets,  and  knowledge  of  this 
function  guides  local  motion  adjustments. 

In  other  situations,  agents  may  know  nothing  about  the  structure  of  their  utility 
functions,  and  how  their  own  utility  depends  on  the  actions  of  other  agents  (whether  lo¬ 
cal  or  far  away).  In  this  case  the  only  thing  they  can  do  is  observe  rewards  based  on  ex¬ 
perience  and  “optimize”  on  a  trial  and  error  basis.  The  situation  is  further  complicated 
because  all  agents  are  trying  simultaneously  to  optimize  their  own  strategies.  There- 
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fore,  even  in  the  absence  of  noise,  an  agent  trying  the  same  strategy  twice  may  see 
different  results  because  of  the  non-stationary  nature  of  the  strategies  of  other  agents. 

There  are  several  examples  of  multi-agent  systems  that  illustrate  this  situation.  In 
distributed  routing  for  ad  hoc  data  networks  (e.g.,  [BK03]),  routing  nodes  seek  to  route 
packets  to  neighboring  nodes  based  on  packet  destinations  without  knowledge  of  the 
overall  network  structure.  The  objective  is  to  minimize  the  delay  of  packets  to  their 
destinations.  This  delay  must  be  realized  through  trial  and  error,  since  the  functional 
dependence  of  delay  on  routing  strategies  is  not  known.  A  similar  problem  is  automo¬ 
tive  traffic  routing,  in  which  drives  seek  to  minimize  the  congestion  experienced  to  get 
to  a  desired  destination.  Drivers  can  experience  the  congestion  on  selected  routes  as  a 
function  of  the  routes  selected  by  other  drivers,  but  drivers  do  not  know  the  structure  of 
the  congestion  function.  Finally,  in  a  multi-agent  approach  to  designing  manufacturing 
systems  (e.g.,  [Ger94]),  it  may  not  be  known  in  advance  how  performance  measures 
(such  as  throughput)  depend  on  manufacturing  policy.  Rather  performance  can  only 
be  measured  once  a  policy  is  implemented. 

Our  interest  in  this  chapter  is  to  develop  algorithms  that  enable  coordination  in 
multi-agent  systems  for  precisely  this  “payoff  based”  scenario,  in  which  agents  only 
have  access  to  (possibly  noisy)  measurements  of  the  rewards  received  through  repeated 
interactions  with  other  agents.  We  adopt  the  framework  of  “learning  in  games”  (see 
[FL98,  Har05,  You98,  You05]  for  an  extensive  overview).  Unlike  most  of  the  learning 
rules  in  this  literature,  which  assume  that  agents  adjust  their  behavior  based  on  the 
observed  behavior  of  other  agents,  we  shall  assume  that  agents  know  only  their  own 
past  actions  and  the  payoffs  that  resulted.  It  is  far  from  obvious  that  Nash  equilibrium 
can  be  achieved  under  such  a  restriction,  but  in  fact  it  has  recently  been  shown  that  such 
“payoff  based”  learning  rules  can  be  constructed  that  work  in  any  game  [FY06,  GL]. 

In  this  chapter  we  show  that  there  are  simpler  and  more  intuitive  adjustment  rules 
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that  achieve  this  objective  for  a  large  class  of  multi-player  games  known  as  “weakly 
acyclic”  games.  This  class  captures  many  problems  of  interest  in  cooperative  control 
[MAS07a,  MAS07b].  It  includes  the  very  special  case  of  “identical  interest”  games, 
where  each  agent  receives  the  same  reward.  However,  weakly  acyclic  games  (and  the 
related  concept  of  potential  games)  capture  other  scenarios  such  as  congestion  games 
[Ros73]  and  similar  problems  such  as  distributed  routing  in  networks,  weapon  tar¬ 
get  assignment,  consensus,  and  area  coverage.  See  [MAS05,  AMS07]  and  referenced 
therein  for  a  discussion  of  a  learning  in  games  approach  to  cooperative  control  prob¬ 
lems,  but  under  less  stringent  assumptions  on  informational  constraints  considered  in 
this  chapter. 

For  many  multi-agent  problems,  operation  at  a  pure  Nash  equilibrium  may  reflect 
optimization  of  a  collective  objective.1  We  will  derive  payoff  based  dynamics  that 
guarantee  asymptotically  that  agent  strategies  will  constitute  a  pure  Nash  equilibrium 
with  arbitrarily  high  probability.  It  need  not  always  be  the  case  that  at  least  one  Nash 
equilibrium  optimizes  a  collective  objective.  Motivated  by  this  consideration,  we  also 
discuss  the  introduction  of  incentives  or  tolls  in  a  player’s  payoff  function  to  assure 
that  there  is  at  least  one  Nash  equilibrium  that  optimizes  a  collective  objective.  Even 
in  this  case,  however,  there  may  still  be  suboptimal  Nash  equilibria. 

The  remainder  of  this  chapter  is  organized  as  follows.  Section  5.2  introduces  three 
types  of  payoff  based  dynamics  in  for  increasingly  general  problems.  Section  5.2.1 
presents  “Safe  Experimentation  Dynamics”  which  is  restricted  to  identical  interest 
games.  Section  5.2.2  presents  “Simple  Experimentation  Dynamics”  for  the  more  gen¬ 
eral  class  of  weakly  acyclic  games  but  with  noise  free  payoff  measurements.  Sec¬ 
tion  5.2.3  presents  “Sample  Experimentation  Dynamics”  for  weakly  acyclic  games 
with  noisy  payoff  measurements.  Section  5.3  discusses  how  to  introduce  tolls  and 

Nonetheless,  there  are  varied  viewpoints  on  the  role  of  Nash  equilibrium  as  a  solution  concept  for 
multi-agent  systems.  See  [SPG07]  and  [MS07]. 
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incentives  in  payoffs  so  that  a  Nash  equilibrium  optimizes  a  collective  objective.  Sec¬ 
tion  5.4  presents  an  illustrative  example  of  a  traffic  congestion  game.  Finally,  Sec¬ 
tion  5.5  contains  some  concluding  remarks.  An  important  analytical  tool  throughout  is 
the  method  of  resistance  trees  for  perturbed  Markov  chains  [You93],  which  is  reviewed 
in  the  appendix  of  this  chapter. 

5.2  Payoff  Based  Learning  Algorithms 

In  this  section,  we  will  introduce  three  simple  payoff  based  learning  algorithms.  The 
first,  called  Safe  Experimentation,  guarantees  convergence  to  a  pure  optimal  Nash 
equilibrium  in  any  identical  interest  game.  Such  an  equilibrium  is  optimal  because 
each  player’s  utility  is  maximized.  The  second  learning  algorithm,  called  Simple 
Experimentation,  guarantees  convergence  to  a  pure  Nash  equilibrium  in  any  weakly 
acyclic  game.  The  third  learning  algorithm,  called  Sample  Experimentation,  guaran¬ 
tees  convergence  to  a  pure  Nash  equilibrium  in  any  weakly  acyclic  game  even  when 
utility  measurements  are  corrupted  with  noise. 

For  each  learning  algorithm,  we  consider  a  repeated  strategic  form  game,  as  de¬ 
scribed  in  Section  2.4,  with  n-player  set  V  :=  {V\,  ...,Vn},  a  finite  action  set  Ai  for 
each  player  V,  <G  V,  and  a  utility  function  Ui  :  A  — >  M  for  each  player  V,  <E  V ,  where 
A  '.=  A i  x  •  •  •  x  An . 

5.2.1  Safe  Experimentation  Dynamics  for  Identical  Interest  Games 

5.2.1. 1  Constant  Exploration  Rates 

Before  introducing  the  learning  dynamics,  we  introduce  the  following  function.  Let 

Ur\t)  :=  max  UAair)) 

0<T<t— 1 
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be  the  maximum  utility  that  player  V,  has  received  up  to  time  t  —  1. 

We  will  now  introduce  the  Safe  Experimentation  dynamics  for  identical  interest 
games;  see  Section  2.3.1  for  a  review  of  identical  interest  games. 

1.  Initialization:  At  time  t  =  0,  each  player  randomly  selects  and  plays  any  action, 
a,i  (0).  This  action  will  be  initially  set  as  the  player’s  baseline  action  at  time  t  =  1 
and  is  denoted  by  a^(l)  =  a.;(0). 

2.  Action  Selection:  At  each  subsequent  time  step,  each  player  selects  his  baseline 
action  with  probability  (1  —  e)  or  experiments  with  a  new  random  action  with 
probability  e,  i.e.: 

•  cii(t )  =  a%(t)  with  probability  (1  —  e) 

•  a,  ( t )  is  chosen  randomly  (uniformly)  over  a,  with  probability  e 
The  variable  e  will  be  referred  to  as  the  player’s  exploration  rate. 

3.  Baseline  Strategy  Update:  Each  player  compares  the  actual  utility  received, 

with  the  maximum  received  utility  £/™ax(f)  and  updates  his  baseline 
action  as  follows: 

(  a, ((),  U,(a(t))  >  ur*(ty, 

«?(<+!)=< 

(4W.  u,(a(t))  <  urxw. 

This  step  is  performed  whether  or  not  Step  2  involved  exploration. 

4.  Return  to  Step  2  and  repeat. 

The  reason  that  this  learning  algorithm  is  called  “Safe”  Experimentation  is  that 
the  utility  evaluated  at  the  baseline  action,  U ( ab(t )),  is  non-decreasing  with  respect  to 
time. 
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Theorem  5.2.1.  Let  G  be  a  finite  n-player  identical  interest  game  in  which  all  players 
use  the  Safe  Experimentation  dynamics.  Given  any  probability  p  <  1,  if  the  exploration 
rate  e  >  0  is  sufficiently  small,  then  for  cdl  sufficiently  large  times  t,  aft )  is  an  optimal 
Nash  equilibrium  of  G  with  at  least  probability  p. 

Proof  Since  G  is  an  identical  interest  game,  let  the  utility  of  each  player  be  expressed 
as  U  :  A  — >  M  and  let  A*  be  the  set  of  “optimal”  Nash  equilibrium  of  G,  i.e., 

A*  =  {a*  G  A  :  Ufa*)  =  max  17(a)}. 

a£A 

For  any  joint  action,  aft),  the  ensuing  joint  action  will  constitute  an  optimal  Nash 
equilibrium  with  at  least  probability 


where  \  Af\  denotes  the  cardinality  of  the  action  set  of  player  Vt.  Therefore,  an  optimal 
Nash  equilibrium  will  eventually  be  played  with  probability  1  for  any  e  >  0. 

Suppose  an  optimal  Nash  equilibrium  is  first  played  at  time  t* ,  i.e.,  aft*)  G  A*  and 
aft*  —  1)7  A*.  Then  the  baseline  joint  action  must  remain  constant  from  that  time 
onwards,  i.e.,  ab(t)  =  aft*)  for  all  t  >  t*.  An  optimal  Nash  equilibrium  will  then  be 
played  at  any  time  t  >  t*  with  at  least  probability  (1  —  e)n.  Since  e  >  0  can  be  chosen 
arbitrarily  small,  and  in  particular  such  that  (1  —  e)n  >  p  this  completes  the  proof.  □ 

5.2.1.2  Diminishing  Exploration  Rates 

In  the  Safe  Experimentation  dynamics,  the  exploration  rate  e  was  defined  as  a  constant. 
Alternatively,  one  could  let  the  exploration  rate  vary  to  induce  desirable  behavior.  One 
example  would  be  to  let  the  exploration  rate  decay,  such  as  et  =  (1/7)  '7'.  This  would 
induce  exploration  at  early  stages  and  reduce  exploration  at  later  stages  of  the  game. 
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The  theorem  and  proof  hold  under  the  following  conditions  for  the  exploration  rate: 


lim  et  =  0, 

t— >  OO 


5.2.2  Simple  Experimentation  Dynamics  for  Weakly  Acyclic  Games 

We  will  now  introduce  the  Simple  Experimentation  dynamics  for  weakly  acyclic  games; 
see  Section  2.3.4  for  a  review  of  weakly  acyclic  games.  These  dynamics  will  allow  us 
to  relax  the  assumption  of  identical  interest  games. 

1 .  Initialization:  At  time  t  —  0,  each  player  randomly  selects  and  plays  any  action, 
Oj(0).  This  action  will  be  initially  set  as  the  player’s  baseline  action  at  time  1, 
i.e.,  a\{  1)  =  a*(0).  Likewise,  the  player’s  baseline  utility  at  time  1  is  initialized 
as  u\(  1)  =  f/j(a(0)). 

2.  Action  Selection:  At  each  subsequent  time  step,  each  player  selects  his  baseline 
action  with  probability  (1  —  e)  or  experiments  with  a  new  random  action  with 
probability  e. 

•  ai(t)  =  a\{t)  with  probability  (1  —  e) 

•  a,i(t)  is  chosen  randomly  (uniformly)  over  a,  with  probability  e 

The  variable  e  will  be  referred  to  as  the  player’s  exploration  rate.  Whenever 
a,  (t)  ^  a\(t),  we  will  say  that  player  V,  experimented. 

3.  Baseline  Action  and  Baseline  Utility  Update:  Each  player  compares  the  utility 
received,  Ui(a(t)),  with  his  baseline  utility,  u\(t),  and  updates  his  baseline  action 
and  utility  as  follows: 

•  If  player  V%  experimented  (i.e.,  a*(t)  ^  a\(t ))  and  if  Ui(a(t ))  >  u\{t)  then 
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a\(t  +  1)  =  aft), 
u\{t  +  1)  =  Ufaft)). 

•  If  player  V,  experimented  and  if  Ufaft ))  <  tfft)  then 

a>i(t+  !)  =  abi(t), 
ubi(t+  1)  =  u\{t). 

•  If  player  V,  did  not  experiment  (i.e.,  aft)  =  afft))  then 

abi(t  +  l)  =  aft), 
u\(t  +  1)  =  Ufaft)). 

4.  Return  to  Step  2  and  repeat. 

As  before,  these  dynamics  require  only  utility  measurements,  and  hence  almost  no 
information  regarding  the  structure  of  the  game. 

Theorem  5.2.2.  Let  G  be  a  finite  n-player  weakly  acyclic  game  in  which  all  players 
use  the  Simple  Experimentation  dynamics.  Given  any  probability  p  <  1,  if  the  explo¬ 
ration  rate  e  >  0  is  sufficiently  small,  then  for  cdl  sufficiently  large  times  t,  aft )  is  a 
Nash  equilibrium  ofG  with  at  least  probability  p. 

The  remainder  of  this  subsection  is  devoted  to  the  proof  of  Theorem  5.2.2.  The 
proof  rely  on  the  theory  of  resistance  trees  for  perturbed  Markov  chains  (see  the  ap¬ 
pendix  of  this  chapter  for  a  brief  review). 

Define  the  state  of  the  dynamics  to  be  the  pair  [a,  u ],  where  a  is  the  baseline  joint 
action  and  u  is  the  baseline  utility  vector.  We  will  omit  the  superscript  b  to  avoid 
cumbersome  notation. 

Partition  the  state  space  into  the  following  three  sets.  First,  let  A"  be  the  set  of  states 
[a,  u]  such  that  tq  Ufa)  for  at  least  one  player  Vt.  Let  E  be  the  set  of  states  [a,  u] 
such  that  Ui  =  Uf  a)  for  all  players  V,  and  a  is  a  Nash  equilibrium.  Let  D  be  the  set 
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of  states  [a,  u]  such  that  Ui  =  Ufa)  for  all  players  Vt  and  a  is  a  disequilibrium  (not  a 
Nash  equilibrium).  These  are  all  the  states. 

Claim  5.2.1.  a.  Any  state  [a,  u]  G  X  transitions  to  a  state  in  E  U  I)  in  one  period 
with  probability  0(1). 

b.  Any  state  [a,u\  €  E  U  D  transitions  to  a  different  state  [a',  u'}  with  probability 
at  most  0(e). 


Proof.  For  any  [a,  u']  G  X,  there  exists  at  least  one  player  V,  such  that  if  f  Ufa).  If 
all  players  repeat  their  part  of  the  joint  action  profile  a  which  occurs  with  probability 
(1  —  e)n,  then  a,  u']  transitions  to  [a,u],  where  Ui  =  Ufa)  for  all  players  Vl.  Thus 
the  process  moves  to  [a,u\  G  E  U  D  with  prob  0(1).  This  proves  statement  (a).  As 
for  statement  (b),  any  state  in  A  U  I)  transitions  back  to  itself  whenever  no  player 
experiments,  which  occurs  with  probability  at  least  0(1).  □ 

Claim  5.2.2.  For  any  state  [ a,u }  G  D,  there  is  a  finite  sequence  of  transitions  to  a 
state  [a* ,  u*]  G  E,  where  the  transitions  have  the  form2: 


where  if  = 
bility  0(e). 


[a,u\  — >  [a1,-?/]  — >  ...  — >•  \a*,u*} 

O(e)  O(e)  O(e) 


Ufak)  for  all  i  and  for  all  k  >  0,  and  each  transition  occurs  with  proba- 


Proof.  Such  a  sequence  is  guaranteed  by  weak  acyclicity.  Since  a  is  not  an  equilib¬ 
rium,  there  is  a  better  reply  path  from  a  to  some  equilibrium  a*,  say  a,  a1,  a2, ...,  a*. 

At  [a,  u]  the  appropriate  player  V,  experiments  with  probability  e,  chooses  the  ap¬ 
propriate  better  reply  with  probability  1/|-4,  |,  and  no  one  else  experiments.  Thus  the 
process  moves  to  [a1,  u1]  where  u)  =  Ufa1)  for  all  players  V,  with  probability  0(e). 

2We  will  use  the  notation  z  — >  z'  to  denote  the  transition  from  state  z  to  state  z' .  We  use  z  — >  z'  to 

O(e) 

emphasize  that  this  transition  occurs  with  probability  of  order  e. 
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Notice  that  for  the  deviator  Vt,  Ut (a 1  j  >  Ufa),  therefore  u\  =  Ufa1).  For  the  non¬ 
deviator,  say  player  72,-,  u]  =  Uj(a1)  since  a]  =  cij.  Thus  [a1,  u1]  G  D  U  E.  In  the  next 
period,  the  appropriate  player  deviates  and  so  forth. 

□ 

Claim  5.2.3.  For  any  equilibrium  \a*,u*]  G  E,  any  path  from  \a*,u*]  to  another  state 
[a,u]  G  E  U  D,  a  f  a*,  that  does  not  loop  back  to  [a*,u*]  must  be  of  one  of  the 
following  two  forms: 

1.  [a*,u*]  — >  [a*,u']  — >  [a',  u"]  — >  ...  — ■>  [a,  w],  where  k  >  2; 

O(e)  0(efc) 

2.  [a*,u*]  \a!,u"]  — >  ...  — >  [a,  m],  where  k  >  2. 

0(efe) 

Proof  The  path  must  begin  by  either  one  player  experimenting  or  more  that  one  player 
experimenting.  Case  (2)  results  if  more  than  one  player  experiments.  Case  (1)  results 
if  exactly  one  agent,  say  agent  Vt,  experiments  with  an  action  a\  f  a*  and  all  other 
players  continue  to  play  their  part  of  a*.  This  happens  with  probability  j^ry(l  —  e)n_1. 
In  this  situation,  player  Vi  cannot  be  better  off,  meaning  that  Ufa'^aff)  <  Ufa*), 
since  by  assumption  a*  is  an  equilibrium.  Hence  the  baseline  action  next  period  re¬ 
mains  a*  for  all  players,  though  their  baseline  utilities  may  change.  Denote  the  next 
state  by  [a*,  u'].  If  in  the  subsequent  period  all  players  continue  to  play  their  part  of 
the  action  a*  again,  which  occurs  with  probability  (1  —  e)n,  then  the  state  reverts  back 
to  [a*,u*]  and  we  have  a  loop.  Hence  the  only  way  the  path  can  continue  without  a 
loop  is  for  one  or  more  players  to  experiment  in  the  next  stage,  which  has  probability 
0(ek),  k  >  1.  This  is  exactly  what  case  (1)  alleges. 

□ 

Proof  of  Theorem  5.2.2.  This  is  a  finite  aperiodic  Markov  process  on  the  state  space 
Ax  U,  where  U  denotes  the  finite  set  of  baseline  utility  vectors.  Furthermore,  from 
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every  state  there  exists  a  positive  probability  path  to  a  Nash  equilibrium.  Hence,  every 
recurrent  class  has  at  least  one  Nash  equilibrium.  We  will  now  show  that  within  any 
recurrent  class,  the  trees  (see  the  appendix  of  this  chapter)  rooted  at  the  Nash  equi¬ 
librium  will  have  the  lowest  resistance.  Therefore,  according  to  Theorem  5.6.1,  the 
a  priori  probability  that  the  state  will  be  a  Nash  equilibrium  can  be  made  arbitrarily 
close  to  1. 

In  order  to  apply  Theorem  5.6.1,  we  will  construct  minimum  resistance  trees  with 
vertices  consisting  of  every  possible  state  (within  a  recurrence  class).  Each  edge  will 
have  resistance  0, 1,  2, ...  associated  with  the  transition  probabilities 
0(1),  0(e),  0(e2), ...,  respectively. 

Our  analysis  will  deviate  slightly  from  the  presentation  in  the  appendix.  In  the  dis¬ 
cussion  in  the  appendix,  the  vertices  of  minimum  resistance  trees  are  recurrence  classes 
of  an  associated  unperturbed  Markov  chain.  In  this  case,  the  unperturbed  Markov  chain 
corresponds  to  Simple  Experimentation  dynamics  with  e  =  0,  and  so  the  recurrence 
classes  are  all  states  in  E  U  D.  Nonetheless,  we  will  construct  resistance  trees  with  the 
vertices  being  all  possible  states,  i.e.,  EUDUX.  The  resulting  conclusions  remain  the 
same.  Since  the  states  in  X  are  transient  with  probability  0(1),  the  resistance  to  leave 
a  node  corresponding  to  a  state  in  X  is  zero.  Therefore,  the  presence  of  such  states 
does  not  affect  the  conclusions  determining  which  states  are  stochastically  stable. 

Suppose  a  minimum  resistance  tree  T  is  rooted  at  a  vertex  v  that  is  not  in  E. 
If  v  G  X,  it  is  easy  to  construct  a  new  tree  that  has  lower  resistance.  Namely,  by 
Claim  5.2.1a,  there  is  a  O-resistance  one-hop  path  P  from  v  to  some  state  [a,u\  G 
E  U  D.  Add  the  edge  of  P  to  T  and  subtract  the  edge  in  T  that  exits  from  the  vertex 
[a,  u] .  This  results  in  a  [a,  u]-trcc  V .  It  has  lower  resistance  than  T  because  the  added 
edge  has  zero  resistance  while  the  subtracted  edge  has  resistance  greater  than  or  equal 
to  1  because  of  Claim  5.2.1b.  This  argument  is  illustrated  in  Figure  5.1,  where  the  red 
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edge  of  strictly  positive  resistance  is  removed  and  replaced  with  the  blue  edge  of  zero 
resistance. 


Original  Tree  T  (Rooted  in  X) 
[a,  u"] 


Revised  Tree  T'  (Rooted  in  D  or  E) 
[a,  u"] 

V 

R  =  0  A 


[a',  u']  < -  [a1,  u]  [a',  u']  < -  [a',  u] 

Figure  5.1:  Construction  of  alternative  to  tree  rooted  in  X. 


Suppose  next  that  v  =  [a,  u\  £  D  but  not  in  E.  Construct  a  path  P  as  in  Claim  5.2.2 
from  [a,u\  to  some  state  [a*,u*]  £  E.  As  above,  construct  a  new  tree  T'  rooted  at 
[a*,  u*]  by  adding  the  edges  of  P  to  T  and  taking  out  the  redundant  edges  (the  edges 
in  T  that  exit  from  the  vertices  in  P).  The  nature  of  the  path  P  guarantees  that  the 
edges  taken  out  have  total  resistance  at  least  as  high  as  the  resistances  of  the  edges  put 
in.  This  is  because  the  entire  path  P  lies  in  E  U  D,  each  transition  on  the  path  has 
resistance  1,  and,  from  Claim  5.2.2b,  the  resistance  to  leave  any  state  in  E  U  D  is  at 
least  1. 

To  construct  a  new  tree  that  has  strictly  lower  resistance,  we  will  inspect  the  effect 
of  removing  the  exiting  edge  from  [a*,u*]  in  T.  Note  that  this  edge  must  fit  either  case 
(1)  or  case  (2)  of  Claim  5.2.3. 

In  case  (2),  the  resistance  of  the  exiting  edge  is  at  least  2,  which  is  larger  than 
any  edge  in  P.  Hence  the  new  tree  has  strictly  lower  resistance  than  T,  which  is  a 
contradiction.  This  argument  is  illustrated  in  Figure  5.2.  A  new  path  is  created  from 
the  original  root  [a,u]  £  D  to  the  equilibrium  a* .  u*]  £  E  (blue  edges).  Redundant 
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(red)  edges  emanating  from  the  new  path  are  removed.  In  case  (2),  the  redundant  edge 
emanating  from  [a*,u*]  has  a  resistance  of  at  least  2. 


Original  Tree  T  (Rooted  in  D  -  Case  2) 
[a,  u]  < -  [a,  u"] 


[a,  u'] 


R  >  1 


[a',  u']  < -  [a',  u"] 


R  >  1  | 

[a",  u"]  [a*,  u'] 


t 


R  >  2 


[a*,  u* 


Revised  Tree  T'  (Rooted  in  E) 
[a,  u]  < -  [a,  u"] 


[a",  u"] 


[a',  u"] 

A 

[a*,  u'] 


R  =  1 


[a*,  u* 


Figure  5.2:  Construction  of  alternative  to  tree  rooted  in  D  for  Case  (2). 


In  case  (1),  the  exiting  edge  has  the  form  [a*,  u*]  — >  [a*,u']  which  has  resistance  1 
where  u*  ^  u' .  The  next  edge  in  T,  say  [a*,  u'}  — >  [a',  u"],  also  has  at  least  resistance 
1.  Remove  the  edge  [a*,u']  — >  [a',  u"]  from  T,  and  put  in  the  edge  [a*,  u']  — >  [a*,u*]. 
The  latter  has  resistance  0  since  [a*,  u ']  G  X.  This  results  in  a  tree  T"  that  is  rooted 
at  [a*,  u*]  and  has  strictly  lower  resistance  than  does  T,  which  is  a  contradiction.  This 
argument  is  illustrated  in  Figure  5.3.  As  in  Figure  5.2,  a  new  (blue)  path  is  constructed 
and  redundant  (red)  edges  are  removed.  The  difference  is  that  the  edge  [a*,u'] 

[a',  u "}  is  removed  and  replaced  with  [a*,u'}  — >  [a*,u*]. 

To  recap,  a  minimum  resistant  tree  cannot  be  rooted  at  any  state  in  X  or  D,  and 
therefore  can  only  be  rooted  in  E.  Therefore,  when  e  is  sufficiently  small,  the  long-run 
probability  on  E  can  be  made  arbitrarily  close  to  1,  and  in  particular  larger  than  any 
specified  probability  p.  □ 
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Original  Tree  T  (Rooted  in  D  -  Case  1) 


Revised  Tree  T'  (Rooted  in  E) 


Figure  5.3:  Construction  of  alternative  to  tree  rooted  in  D  for  Case  (1). 


5.2.3  Sample  Experimentation  Dynamics  for  Weakly  Acyclic  Games  with  Noisy 
Utility  Measurements 

5.2.3.1  Noise-free  Utility  Measurements 

In  this  section  we  will  focus  on  developing  payoff  based  dynamics  for  which  the  limit¬ 
ing  behavior  exhibits  that  of  a  pure  Nash  equilibrium  with  arbitrarily  high  probability 
in  any  finite  weakly  acyclic  game  even  in  the  presence  of  utility  noise.  We  will  show 
that  a  variant  of  the  so-called  Regret  Testing  algorithm  [FY06]  accomplishes  this  ob¬ 
jective  for  weakly  acyclic  games  with  noisy  utility  measurements. 

We  now  introduce  Sample  Experimentation  dynamics. 


1 .  Initialization:  At  time  t  —  0,  each  player  randomly  selects  and  plays  any  action, 
aj(0)  G  Ai.  This  action  will  be  initially  set  as  the  player’s  baseline  action , 
ai(l)  =  ai(0). 

2.  Exploration  Phase:  After  the  baseline  action  is  set,  each  player  engages  in  an 
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exploration  phase  over  the  next  m  periods.  The  length  of  the  exploration  phase 
need  not  be  the  same  or  synchronized  for  each  player,  but  we  will  assume  that 
they  are  for  the  proof.  For  convenience,  we  will  double  index  the  time  of  the 
actions  played  as 

a(ti,t2)  =  a{m  h  +  t2) 

where  t\  indexes  the  number  of  the  exploration  phase  and  t2  indexes  the  actions 
played  in  that  exploration  phase.  We  will  refer  to  t\  as  the  exploration  phase 
time  and  t2  as  the  exploration  action  time.  By  construction,  the  exploration 
phase  time  and  exploration  action  time  satisfy  ti  >  1  and  m  >  t2  >  1.  The 
baseline  action  will  only  be  updated  at  the  end  of  the  exploration  phase  and  will 
therefore  only  be  indexed  by  the  exploration  phase  time. 

During  the  exploration  phase,  each  player  selects  his  baseline  action  with  prob¬ 
ability  (1  —  e)  or  experiments  with  a  new  random  action  with  probability  e.  That 
is,  for  any  exploration  phase  time  t\  >  1  and  for  any  exploration  action  time 
satisfying  m  >  t2  >  1, 

•  hi(ti,t2)  =  a^(ti)  with  probability  (1  —  e), 

•  hi(ti,t2)  is  chosen  randomly  (uniformly)  over  (Ai\a}’(1,\ ))  with  probability 
e. 


Again,  the  variable  e  will  be  referred  to  as  the  player’s  exploration  rate. 

3.  Action  Assessment:  After  the  exploration  phase,  each  player  evaluates  the  av¬ 
erage  utility  received  when  playing  each  of  his  actions  during  the  exploration 
phase.  Let  n^(ti)  be  the  number  of  times  that  player  V,  played  action  a*  dur¬ 
ing  the  exploration  phase  at  time  t,\.  The  average  utility  for  action  a,  during  the 


90 


exploration  phase  at  time  t\  is 


^  ^  E™=iJ{ai  =  ai(t1,t2)}Ui(a(t1,t2)),  nf(ti)  >  0; 

\umin,  n“i(ti)=0, 

where  /{•}  is  the  usual  indicator  function  and  Urrnn  satisfies 


U min  <  min  ruin  Ul(a). 
i  a£A 

In  words,  Umin  is  less  than  the  smallest  payoff  any  agent  can  receive. 

4.  Evaluation  of  Better  Response  Set:  Each  player  compares  the  average  utility 

received  when  playing  his  baseline  action,  with  the  average  utility 

received  for  each  of  his  other  actions,  V**  (t\ ),  and  finds  all  played  actions  which 
performed  5  better  than  the  baseline  action.  The  term  5  will  be  referred  to  as  the 
players’  tolerance  level.  Define  A*(ti)  to  be  the  set  of  actions  that  outperformed 
the  baseline  action  as  follows: 

A*(t i)  :=  | a*  G  Ai  :  V^(h)  >  V**(tl) (h)  +  d}  .  (5.1) 

5.  Baseline  Strategy  Update:  Each  player  updates  his  baseline  action  as  follows: 

•  If  A*  (h)  =  0,  then  a\(ti  +  1)  =  a-(fi). 

•  If  A*  (t\)  ^  0,  then 

-  With  probability  lu,  set  a\(ti  +  1)  =  (We  will  refer  to  u>  as  the 

player’s  inertia.) 

-  With  probability  1  —  u>,  randomly  select  a\(t\  +  1)  G  A*(ti)  with 
uniform  probability. 

6.  Return  to  Step  2  and  repeat. 
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For  simplicity,  we  will  first  state  and  prove  the  desired  convergence  properties 
using  noiseless  utility  measurements.  The  setup  for  the  noisy  utility  measurements 
will  be  stated  afterwards. 

Before  stating  the  following  theorem,  we  define  the  constant  a  >  0  as  follows. 
If  Ufa1)  f  Ufa2)  for  any  joint  actions  a1,  a2  G  A  and  any  player  V,  G  V.  then 
U,  (a 1  j  —  Ufa2)  |  >  a.  In  words,  if  any  two  joint  actions  result  in  different  utilities  at 
all,  then  the  difference  would  be  at  least  a. 

Theorem  5.2.3.  Let  G  be  a  finite  n-player  weakly  acyclic  game  in  which  all  players 
use  the  Sample  Experimentation  dynamics.  For  any 

•  probability  p  <  1, 

•  tolerance  level  5  G  (0,  a), 

•  inertia  u  G  (0, 1),  and 

•  exploration  rate  e  satisfying  min{(a  —  <5)/4,  5/4, 1  —  p}  >  (1  —  (1  —  e)n)  >  0, 

if  the  exploration  phase  length  m  is  sufficiently  large,  then  for  all  sufficiently  large 
times  t  >  0,  aft)  is  a  Nash  equilibrium  ofG  with  at  least  probability  p. 

The  remainder  of  this  subsection  is  devoted  to  the  proof  of  Theorem  5.2.3. 

We  will  assume  for  simplicity  that  utilities  are  between  -1/2  and  1/2,  i.e.,  |  Ufa)  \  < 
1/2  for  any  player  V,  G  V  and  any  joint  action  a  G  A. 

We  begin  with  a  series  of  useful  claims.  The  first  claim  states  that  for  any  player 
Vi  the  average  utility  for  an  action  at  G  At  during  the  exploration  phase  can  be  made 
arbitrarily  close  (with  high  probability)  to  the  actual  utility  the  player  would  have  re¬ 
ceived  provided  that  all  other  players  never  experimented.  This  can  be  accomplished 
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if  the  experimentation  rate  is  sufficiently  small  and  the  exploration  phase  length  is 
sufficiently  large. 

Claim  5.2.4.  Let  ab  be  the  joint  baseline  action  at  the  start  of  an  exploration  phase  of 
length  m.  For 


•  any  probability  p  <  1, 

•  any  5*  >  0,  and 

•  any  exploration  rate  e  >  0  satisfying  5*  / 2  >  (1  —  (1  —  e)n_1)  >  0, 


if  the  exploration  phase  length  m  is  sufficiently  large  then 


Pr 


UifaaJLi)  >5* 


<  1  —  p. 


Proof  Let  nfa,)  represent  the  number  of  times  player  V,  played  action  a*  during  the 
exploration  phase.  In  the  following  discussion,  all  probabilities  and  expectations  are 
conditioned  on  nfaf)  >  0.  We  omit  making  this  explicit  for  the  sake  of  notational 
simplicity.  The  event  nfaf)  =  0  has  diminishing  probability  as  the  exploration  phase 
length  m  increases,  and  so  this  case  will  not  affect  the  desired  conclusions  for  increas¬ 
ing  phase  lengths. 

For  an  arbitrary  5*  >  0, 


Pr 

\vai  - 
1  * 

U,(a. 

<  Pr 

)vy 

<  Pr 

ik“ 

V 

- 

•‘,)|  >  t 

\K‘  -  E{V?‘}\  +  I E{V?<}  -  >  S’ 

|Vf‘  -  E{ V‘‘}\ >  <5*/2l  +  Pr  [|£{Vf  }  -  Ui(ai,ab_ ,)|  >  S’/ 2 

-  -  is  ^ _  - 


First,  let  us  focus  on  (**).  We  have 


(**) 


E{Vn  -  Ufauaff  =  [1  -  (1  -  e)^1]  |£7{Pi(ai,o_i(i))|a_i(i)  f  a6_J  -  U^a0) 
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which  approaches  0  as  e  j.  0.  Therefore,  for  any  exploration  rate  e  satisfying  5*  / 2  > 
(1  —  (1  —  e)n_1)  >  0,  we  know  that 


Pr 


E{V“i}-Ufai,ab_i)\>5*/2 


0. 


Now  we  will  focus  on  (*).  By  the  weak  law  of  large  numbers,  (*)  approaches  0  as 
nfaf  t  oo.  This  implies  that  for  any  probability  p  <  1  and  any  exploration  rate  e  >  0, 
there  exists  a  sample  size  n*(a.i)  such  that  if  nfaf)  >  n*(a,i )  then 


Pr 


E{V‘‘}\>p/ 2 


<  1  ~  p. 


Lastly,  for  any  probability  p  <  1  and  any  fixed  exploration  rate,  there  exists  a  minimum 
exploration  length  m  >  0  such  that  for  any  exploration  length  m  >  m, 


Pr  [ nfat )  >  n -(a*)]  >  p. 

In  summary,  for  any  fixed  exploration  rate  e  satisfying  S* / 2  >  (1  —  (1  —  e)n_1)  >  0, 
(*)  +  can  be  made  arbitrarily  close  to  0,  provided  that  the  exploration  length  m  is 
sufficiently  large.  □ 

Claim  5.2.5.  Let  ab  be  the  joint  baseline  action  at  the  start  of  an  exploration  phase  of 
length  m.  For  any 


•  probability  p  <  1, 

•  tolerance  level  6  G  (0,  a),  and 

•  exploration  rate  e  >  0  satisfying  min{(a  —  <5)/4, 5/4}  >  (1  —  (1  —  e)n  x)  >  0, 

if  the  exploration  length  rri  is  sufficiently  large,  then  each  player’s  better  response  set 
a*  will  contain  only  and  all  actions  that  are  a  better  response  to  the  joint  baseline 
action,  i.e., 

a*  e  A*  ^  Ufa*,  off  >  Ufab) 

with  at  least  probability  p. 
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Proof.  Suppose  ab  is  not  a  Nash  equilibrium.  For  some  player  V,  G  V,  let  a*  be  a 
strict  better  reply  to  the  baseline  joint  action,  i.e.  Ui  (a*,  ab_f  >  Ui(ab )  and  let  a™  be  a 
non-better  reply  to  the  baseline  joint  action,  i.e.  Ufa™ ,  ah_f)  <  Ufab). 


Using  Claim  5.2.4,  for  any  probability  p  <  1  and  any  exploration  rate  e  >  0 
satisfying  min{(o;  —  <5)/4,  5/4}  >  (1  —  (1  —  e)n_1)  >  0  there  exists  a  minimum 
exploration  length  m>  0  such  that  for  any  exploration  length  m  >  m  the  following 
expressions  are  true: 


Pr 
Pr 
Pr 

where  5*  =  min{(a  —  <5)/2,  5/2}.  Rewriting  equation  5.2  we  obtain 


Ui(abi,ab_i)\  <  5* 

> 

P , 

(5.2) 

Ui(a’,ab_i)\<S‘ 

> 

P , 

(5.3) 

> 

P, 

(5.4) 

Pr 


\Vf'  -Ufafa^fK  5 


<  Pr 


Vf'  -UfafatfKia-S)/ 2 


and  rewriting  equation  5.3  we  obtain 


Pr 

'\V<  -  Ufa*, aUfK  5* 

<  Pr 

<3* 

<42* 

<  Pr 

ua* 

v  l 

=  Pr 

e* 

A? 

j  1 _ 

Ufafab_l)>-(a-5)/2\, 
(Ufab,  aU)  +  a)  >  -(a -5)/ 2]  , 
Ufab,ab_i)>(a  +  5)/2], 


meaning  that 


Pr  [a*  G  A*]  >  p2. 


Similarly,  rewriting  equation  5.2  we  obtain 


Pr 


\V?  -UfalaUfKS 


and  rewriting  equation  5.4  we  obtain 


Pr 


\VT  -Ufaf,a?_J\<5 


<  Pr 

<  Pr 

<  Pr 


V“i-Ufab,ab_i)>-5/ 2 

V*'  -  Ufa™, 5/2 
V<  -Ufafab_f<  5/2 
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meaning  that 


Pr  \af  <£  A*]  >  p2. 


Since  p  can  be  chosen  arbitrarily  close  to  1,  the  proof  is  complete. 


□ 


Theorem  5.2.3.  The  evolution  of  the  baseline  actions  from  phase  to  phase  is  a  finite 
aperiodic  Markov  process  on  the  state  space  of  joint  actions,  A.  Furthermore,  since  G 
is  weakly  acyclic,  from  every  state  there  exists  a  better  reply  path  to  a  Nash  equilib¬ 
rium.  Hence,  every  recurrent  class  has  at  least  one  Nash  equilibrium.  We  will  show  that 
these  dynamics  can  be  viewed  as  a  perturbation  of  a  certain  a  Markov  chain  whose  re¬ 
current  classes  are  restricted  to  Nash  equilibria.  We  will  then  appeal  to  Theorem  5.6.1 
to  derive  the  desired  result. 

We  begin  by  defining  an  “unperturbed”  process  on  baseline  actions.  For  any  ab  £ 
A,  define  the  true  better  reply  set  as 

AV)  :=  {ai:Ui(ai,ab_i)>Ui(ab)}. 

Now  define  the  transition  process  from  ab{t\)  to  ab{ti  +  1)  as  follows: 

•  If  A*(ab(ti))  =  0,  then  ab{ti  +  1)  =  ab{ti). 

•  If  A*(ab(ti))  7^  0,  then 

-  With  probability  c o,  set  ab(ti  +  1)  =  ab{ti). 

-  With  probability  1  —  u,  randomly  select  ab(ti  +  1)  £  A* (t\  )  with  uniform 
probability. 

This  is  a  special  case  of  a  so-called  “better  reply  process  with  finite  memory  and  iner¬ 
tia”.  From  [You05,  Theorem  6.2],  the  joint  actions  of  this  process  converge  to  a  Nash 
equilibrium  with  probability  1  in  any  weakly  acyclic  game.  Therefore,  the  recurrence 
classes  of  this  unperturbed  are  precisely  the  set  of  pure  Nash  equilibria. 
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The  above  unperturbed  process  closely  resembles  the  Baseline  Strategy  Update 
process  described  in  Step  5  of  Sample  Experimentation  Dynamics.  The  difference 
is  that  the  above  process  uses  the  true  better  reply  set,  whereas  Step  5  uses  a  better 
reply  set  constructed  from  experimentation  over  a  phase.  However,  by  Claim  5.2.5,  for 
any  probability  p  <  1,  acceptable  tolerance  level  5,  and  acceptable  exploration  rate  e, 
there  exists  a  minimum  exploration  phase  length  m  such  that  for  any  exploration  phase 
length  m  >  m,  each  player’s  better  response  set  will  contain  only  and  all  actions  that 
are  a  strict  better  response  with  at  least  probability  p. 

With  parameters  selected  according  to  Claim  5.2.5,  the  transitions  of  the  baseline 
joint  actions  in  Sample  Experimentation  Dynamics  follow  that  of  the  above  unper¬ 
turbed  better  reply  process  with  probability  p  arbitrarily  close  to  1.  Since  the  recur¬ 
rence  classes  of  the  unperturbed  process  are  only  Nash  equilibria,  we  can  conclude 
from  Theorem  5.6.1  that  as  p  approaches  1,  the  probability  that  the  baseline  action  for 
sufficiently  large  t  \  will  be  a  (pure)  Nash  equilibrium  can  be  made  arbitrarily  close  to 
1.  By  selecting  the  exploration  probability  e  sufficiently  small,  we  can  also  conclude 
that  the  joint  action  during  exploration  phases,  i.e.,  a(mt\  +  t2),  will  also  be  a  Nash 
equilibrium  with  probability  arbitrarily  close  to  1 . 

□ 


5.2.3.2  Noisy  Utility  Measurements 

Suppose  that  each  player  receives  a  noisy  measurement  of  his  true  utility,  i.e., 

d—i)  O— i)  T  r'j, 

where  v,  is  an  i.i.d.  random  variable  with  zero  mean.  In  the  regret  testing  algorithm 
with  noisy  utility  measurements,  the  average  utility  for  action  a,  during  the  exploration 
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phase  at  time  t\  is  now 


^  E™= iJ{«i  =  ai(ti,t2)}Ui(a(t1,t2)),  n?(t- 1)  >  0; 

n“i(ti)=0. 

A  straightforward  modification  of  the  proof  of  Theorem  5.2.3  leads  to  the  following 
theorem. 

Theorem  5.2.4.  Let  G  be  a  finite  n-player  weakly  acyclic  game  where  players  ’  utilities 
are  corrupted  with  a  zero  mean  noise  process.  If  all  players  use  the  regret  testing 
dynamics,  then  for  any 

•  probability  p  <  1, 

•  tolerance  level  8  €  (0,  a), 

•  inertia  u  G  (0, 1),  and 

•  exploration  rate  e  satisfying  min{(o;  —  <5)/4,  5/4, 1  —  p}  >  (1  —  (1  —  e)n)  >  0, 

if  the  exploration  phase  length  m  is  sufficiently  large,  then  for  all  sufficiently  large 
times  t  >  0,  aft )  is  a  Nash  equilibrium  ofG  with  at  least  probability  p. 

5.2.3.3  Comment  on  Length  and  Synchronization  of  Players’  Exploration  Phases 

In  the  proof  of  Theorem  5.2.3,  we  assumed  that  all  players’  exploration  phases  were 
synchronized  and  of  the  same  length.  This  assumption  was  used  to  ensure  that  the 
baseline  action  of  the  other  players  remained  constant  when  a  player  assessed  the  per¬ 
formance  of  a  particular  action.  Because  of  the  players’  inertia  this  assumption  is 
unnecessary.  The  general  idea  is  as  follows:  a  player  will  repeat  his  baseline  action 
regardless  of  his  better  response  set  with  positive  probability  because  of  his  inertia. 
Therefore,  if  all  players  repeat  their  baseline  action  a  sufficient  number  of  times,  which 
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happens  with  positive  probability,  then  the  joint  baseline  action  would  remain  constant 
long  enough  for  any  player  to  evaluate  an  accurate  better  response  set  for  that  particular 
joint  baseline  action. 

5.3  Influencing  Nash  Equilibria  in  Resource  Allocation  Problems 

In  this  section  we  will  derive  an  approach  for  influencing  the  Nash  equilibria  of  a 
resource  allocation  problem  using  the  idea  of  marginal  cost  pricing.  We  will  illustrate 
the  setup  and  our  approach  on  a  congestion  game  which  is  an  example  of  a  resource 
allocation  problem. 

5.3.1  Congestion  Game  with  Tolls  Setup 

We  consider  a  congestion  game,  as  defined  in  Section  2.3.3,  with  a  player  set  V  = 
{Pi, . . . ,  Vn},  a  set  of  resources  P,  and  a  congestion  cr  :  {0, 1,2,...}  — >  M  for  each 
resource  r  G  P. 

One  approach  for  equilibrium  manipulation  is  to  influence  drivers’  utilities  with 
tolls  [San02],  as  introduced  in  Section  3.4.2.  In  a  congestion  game  with  tolls,  a  driver’s 
utility  takes  on  the  form 

Ui(a)  -  -  ^  cr(ar(a ))  +  tr(ar(a)), 

r&Ai 

where  tr(k)  is  the  toll  imposed  on  route  r  if  there  are  k  users. 

In  Section  3.4.2,  we  analyzed  the  situation  in  which  a  global  planner  was  interested 
in  minimizing  the  total  congestion  experienced  by  all  drivers  on  the  network,  which 
can  be  evaluated 

Tc(a )  :=  y^crr(a)cr(crr(a)). 

reR 

Now  suppose  that  the  global  planner  is  interested  in  minimizing  a  more  general 


99 


measure3, 


0(a)  :=  y ^Jr(ar(a))cr(ar(a)).  (5.5) 

rCR 

An  example  of  an  objective  function  that  fits  within  this  framework  and  may  be  prac¬ 
tical  for  general  resource  allocation  problems  is 

0(a)  =  y^cr(q-r(a)). 

r&R 

We  will  now  show  that  there  exists  a  set  of  tolls,  tf-),  such  that  the  potential 
function  associated  with  the  congestion  game  with  tolls  will  be  aligned  with  the  global 
planner’s  objective  function  of  the  form  given  in  equation  (5.5). 

Proposition  5.3.1.  Consider  a  congestion  game  of  any  network  topology.  If  the  im¬ 
posed  tolls  are  set  as 

tr(k )  =  ( fr(k )  —  1  )cr(k)  —  fr(k  —  1  )cr(k  —  1),  V/c  >  1, 

then  the  global  planners  objective,  0c(a)  =  —0(a),  is  a  potential  function  for  the 
congestion  game  with  tolls. 

Proof.  Let  a 1  =  {a} ,  a_j}  and  a2  =  {a2,  We  will  use  the  shorthand  notation  erf 
to  represent  ay  (a1).  The  change  in  utility  incurred  by  driver  d,  in  changing  from  route 
a2  to  route  a]  is 

Ufa1)  -  Ufa2)  =  -  (cfaf)  +  tr(of))  +  (■ cfaf )  +  A«2)), 

reAj  reA ? 

=  -  (Cr(a“J)  +tr(af))  +  (Cr«2)  +tr«2)). 

r£Aj\af  reAf\aj 

3  In  fact,  if  cr(crr(a))  f  0  for  all  a,  then  (5.5)  is  equivalent  to  R  /r(crr(a))  where 
ffo’fa,))  = 
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The  change  in  the  total  negative  congestion  from  the  joint  action  a 2  to  a 1  is 


Ma1)  -  0c(a2)  =  -  X  -  /r«2)cr«2))- 

re(afua?) 

Since 

V  (/,«>,«'  )  -  fr(af  )cr(crf  ))  =  0, 
re(ajna?) 

the  change  in  the  total  negative  congestion  is 

Ma1)  -  0c(«2)  =  -  X  (/''(<1)Cr(<1)  -  /r«2)cr«2)) 

-  X  (/rK^rK1)  -/r«2)CrK2))- 

Expanding  the  first  term,  we  obtain 

X  (/rK^C r-K1)  -  /r«2)Cr«2)) 

=  X]  (/r^'jCrK1)  -  (fr(vf  ~  l))^^  “  1)), 

re^i\af 

=  X]  (/rK^CrK1)  -  ((/r^)  ~  l)cr  (<' )  -  tr  (<' )))  , 

re^Vaf 

=  X  (CrK1)  +fr«1))- 
re^\a? 

Therefore, 

0c(a1)  -  0c(a2)  =  -  X  (cr^1)  Tfr^1))  +  X  M<~)  +tr«2)), 

r6Aj\af  rG^?\ot 

=  Uiia1)  —  Ui(a2). 

□ 
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By  implementing  the  tolling  scheme  set  forth  in  Proposition  5.3.1,  we  guarantee 
that  all  action  profiles  that  minimize  the  global  planner’s  objective  are  equilibrium  of 
the  congestion  game  with  tolls. 

In  the  special  case  that  fr(ar(a))  =  ar(a),  then  Proposition  5.3.1  produces  the 
same  tolls  as  in  Proposition  3.4.1 

5.4  Illustrative  Example  -  Braess’  Paradox 

We  will  consider  a  discrete  representation  of  the  congestion  game  setup  considered  in 
Braess’  Paradox  [Bra68].  In  our  setting,  there  are  1000  vehicles  that  need  to  traverse 
through  the  network.  The  network  topology  and  associated  congestion  functions  are 
illustrated  in  Figure  5.4.  Each  vehicle  can  select  one  of  the  four  possible  paths  to 
traverse  across  the  network. 


Figure  5.4:  Congestion  Game  Setup  -  Braess’  Paradox 


The  reason  for  using  this  setup  as  an  illustration  of  the  learning  algorithms  and 
equilibrium  manipulation  approach  developed  in  this  chapter  is  that  the  Nash  equilib¬ 
rium  of  this  particular  congestion  game  is  easily  identifiable.  The  unique  Nash  equi- 
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librium  is  when  all  vehicles  take  the  route  as  highlighted  in  Figure  5.5.  At  this  Nash 
equilibrium  each  vehicle  has  a  utility  of  2  and  the  total  congestion  is  2000. 


Figure  5.5:  Illustration  of  Nash  Equilibrium  in  Braess’  Paradox. 


Since  a  potential  game  is  weakly  acyclic,  the  payoff  based  learning  dynamics  in  this 
chapter  are  applicable  learning  algorithms  for  this  congestion  game.  In  a  congestion 
game,  a  payoff  based  learning  algorithms  means  that  drivers  have  access  only  to  the 
actual  congestion  experienced.  Drivers  are  unaware  of  the  congestion  level  on  any 
alternative  routes.  Figure  5.6  shows  the  evolution  of  drivers  on  routes  when  using  the 
Simple  Experimentation  dynamics.  This  simulation  used  an  experimentation  rate  of 
e  =  0.25%.  The  colors  on  the  plots  are  consistent  with  the  colors  of  each  route  as 
indicated  in  Figure  5.4.  One  can  observe  that  the  vehicles’  collective  behavior  does 
indeed  approach  that  of  the  Nash  equilibrium. 

In  this  congestion  game,  it  is  also  easy  to  verify  that  this  vehicle  distribution  does 
not  minimize  the  total  congestion  experience  by  all  drivers  over  the  network.  The  dis¬ 
tribution  that  minimizes  the  total  congestion  over  the  network  is  when  half  the  vehicles 
occupy  the  top  two  roads  and  the  other  half  occupy  the  bottom  two  roads.  The  middle 
road  (pink)  is  irrelevant. 
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1000 


Figure  5.6:  Braess’  Paradox:  Evolution  of  Number  of  Vehicles  on  Each  Road  Using  Simple 
Experimentation  Dynamics 

One  can  employ  the  tolling  scheme  developed  in  the  previous  section  to  locally 
influence  vehicle  behavior  to  achieve  this  objective.  In  this  setting,  the  new  cost  func¬ 
tions,  i.e.  congestion  plus  tolls,  are  illustrated  in  Figure  5.7. 

Figure  5.8  shows  the  evolution  of  drivers  on  routes  when  using  the  Simple  Exper¬ 
imentation  dynamics.  This  simulation  used  an  experimentation  rate  of  e  =  0.25%. 
When  using  this  tolling  scheme,  the  vehicles’  collective  behavior  approaches  the  re¬ 
fined  Nash  equilibrium  which  now  minimizes  the  total  congestion  experienced  on  the 
network.  The  total  congestion  experienced  on  the  network  is  now  approximately  1500. 

There  are  other  tolling  schemes  that  would  have  resulted  in  the  desired  allocation. 
One  approach  is  to  assign  an  infinite  cost  to  the  middle  road,  which  is  equivalent  to 
removing  it  from  the  network.  Under  this  scenario,  the  unique  Nash  equilibrium  is  for 
half  the  vehicles  to  occupy  the  top  route  and  half  the  bottom,  which  would  minimize 
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Figure  5.7:  Braess’  Paradox:  Congestion  Game  Setup  with  Tolls  to  Minimize  Total  Congestion 

the  total  congestion  on  the  network.  Therefore,  the  existence  of  this  extra  road,  even 
though  it  has  zero  cost,  resulted  in  the  unique  Nash  equilibrium  having  a  higher  total 
congestion.  This  is  Braess’  Paradox  [Bra68]. 

The  advantage  of  the  tolling  scheme  set  forth  in  this  chapter  is  that  it  gives  a  sys¬ 
tematic  method  for  influencing  the  Nash  equilibria  of  any  congestion  game.  We  would 
like  to  highlight  that  this  tolling  scheme  only  guarantees  that  the  action  profiles  that 
maximize  the  desired  objective  function  are  Nash  equilibria  of  the  new  congestion 
game  with  tolls.  However,  it  does  not  guarantee  the  lack  of  suboptimal  Nash  equilib¬ 
ria. 

In  many  applications,  players  may  not  have  access  to  their  true  utility,  but  do  have 
access  to  a  noisy  measurement  of  their  utility.  For  example,  in  the  traffic  setting,  this 
noisy  measurement  could  be  the  result  of  accidents  or  weather  conditions.  We  will 
revisit  the  original  congestion  game  (without  tolls)  as  illustrated  in  Figure  5.4.  We  will 
now  assume  that  a  driver’s  utility  measurement  takes  on  the  form 


reAi 


where  vt  is  a  random  variable  with  zero  mean  and  variance  of  0. 1 .  We  will  assume  that 
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Figure  5.8:  Braess’  Paradox:  Evolution  of  Number  of  Vehicles  on  Each  Road  Using  Simple 
Experimentation  Dynamics  with  Optimal  Tolls 

the  noise  is  driver  specific  rather  than  road  specific. 

Figure  5.9  shows  a  comparison  of  the  evolution  of  drivers  on  routes  when  using  the 
Simple  and  Sample  Experimentation  dynamics.  The  Simple  Experimentation  dynam¬ 
ics  simulation  used  an  experimentation  rate  e  =  0.25%.  The  Sample  Experimentation 
dynamics  simulation  used  an  exploration  rate  e  =  0.25%,  a  tolerance  level  5  =  0.002, 
an  exploration  phase  length  m  =  500000,  and  inertia  u  =  0.85.  As  expected,  the  noisy 
utility  measurements  influenced  vehicle  behavior  more  in  the  Simple  Experimentation 
dynamics  than  the  Sample  Experimentation  dynamics. 


5.5  Concluding  Remarks  and  Future  Work 

We  have  introduced  Safe  Experimentation  dynamics  for  identical  interest  games,  Sim¬ 
ple  Experimentation  dynamics  for  weakly  acyclic  games  with  noise-free  utility  mea- 


106 


Simple  Experimentation  Dynamics  Sample  Experimentation  Dynamics 


Figure  5.9:  Braess’  Paradox:  Comparison  of  Evolution  of  Number  of  Vehicles  on  Each  Road 
Using  Simple  Experimentation  Dynamics  and  Sample  Experimentation  Dynamics  (baseline) 
with  Noisy  Utility  Measurements 

surements,  and  Sample  Experimentation  dynamics  for  weakly  acyclic  games  with 
noisy  utility  measurements.  For  all  three  settings,  we  have  shown  that  for  sufficiently 
large  times,  the  joint  action  taken  by  players  will  constitute  a  Nash  equilibrium.  Fur¬ 
thermore,  we  have  shown  how  to  guarantee  that  a  collective  objective  in  a  congestion 
game  is  a  (non-unique)  Nash  equilibrium. 

Our  motivation  has  been  that  in  many  engineered  systems,  the  functional  forms  of 
utility  functions  are  not  available,  and  so  players  must  adjust  their  strategies  through  an 
adaptive  process  using  only  payoff  measurements.  In  the  dynamic  processes  defined 
here,  there  is  no  explicit  cooperation  or  communication  between  players.  One  the  one 
hand,  this  lack  of  explicit  coordination  offers  an  element  of  robustness  to  a  variety  of 
uncertainties  in  the  strategy  adjustment  processes.  Nonetheless,  an  interesting  future 
direction  would  be  to  investigate  to  what  degree  explicit  coordination  through  limited 
communications  could  be  beneficial. 
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5.6  Appendix  to  Chapter  5 


5.6.1  Background  on  Resistance  Trees 

For  a  detailed  review  of  the  theory  of  resistance  trees,  please  see  [You93].  Let  P° 
denote  the  probability  transition  matrix  for  a  finite  state  Markov  chain  over  the  state 
space  Z.  Consider  a  “perturbed”  process  such  that  the  size  of  the  perturbations  can 
be  indexed  by  a  scalar  e  >  0,  and  let  Pe  be  the  associated  transition  probability  ma¬ 
trix.  The  process  Pe  is  called  a  regular  perturbed  Markov  process  if  Pe  is  ergodic 
for  all  sufficiently  small  e  >  0  and  Pe  approaches  P°  at  an  exponentially  smooth  rate 
[You93].  Specifically,  the  latter  condition  means  that  Vz,  z!  G  Z, 

lim  Pezz,  =  P°zz,, 

e— >0+ 

and 

Pe  , 

Pt,  >  0  for  some  e  >  0  =>  0  <  lim  .  zz> <  oo, 

22  e^o+  er(2_>2 ) 

for  some  nonnegative  real  number  r(z  — >  z'),  which  is  called  the  resistance  of  the 

transition  z  — >  z' .  (Note  in  particular  that  if  Pzz,  >  0  then  r(z  — >  z')  =  0.) 

Let  the  recurrence  classes  of  P°  be  denoted  by  E\,  E2,  ...,EN.  For  each  pair  of 
distinct  recurrence  classes  Et  and  Ev  i  ^  j,  an  ij-path  is  defined  to  be  a  sequence 
of  distinct  states  (  =  (z\  —>  z2  ■■■  — >  zn)  such  that  z\  G  Et  and  zn  G  E:j.  The 
resistance  of  this  path  is  the  sum  of  the  resistances  of  its  edges,  that  is,  r(()  =  r{z\  — > 
z2)  +  r(z 2  — >  z3)  +  ...  +  r(zn_i  — >  zn).  Let  ptj  =  minr(C)  be  the  least  resistance 
over  all  ij-paths  Note  that  pt]  must  be  positive  for  all  distinct  i  and  j,  because  there 
exists  no  path  of  zero  resistance  between  distinct  recurrence  classes. 

Now  construct  a  complete  directed  graph  with  N  vertices,  one  for  each  recurrence 
class.  The  vertex  corresponding  to  class  Ej  will  be  called  j.  The  weight  on  the  directed 
edge  i  — >  j  is  pl3.  A  tree,  T,  rooted  at  vertex  j,  or  j -tree,  is  a  set  of  —  1  directed  edges 
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such  that,  from  every  vertex  different  from  j,  there  is  a  unique  directed  path  in  the  tree 
to  j.  The  resistance  of  a  rooted  tree,  T,  is  the  sum  of  the  resistances  pij  on  the  N  —  1 
edges  that  compose  it.  The  stochastic  potential,  7 j,  of  the  recurrence  class  E,  is  defined 
to  be  the  minimum  resistance  over  all  trees  rooted  at  j.  The  following  theorem  gives  a 
simple  criterion  for  determining  the  stochastically  stable  states  ([You93],  Theorem  4). 

Theorem  5.6.1.  Let  Pe  be  a  regular  perturbed  Markov  process,  and  for  each  e  >  0  let 
//  be  the  unique  stationary  distribution  of  Pe.  Then  lime_>0  //  exists  and  the  limiting 
distribution  //°  is  a  stationary  distribution  of  P°.  The  stochastically  stable  states  (i.e., 
the  support  of  p°)  are  precisely  those  states  contained  in  the  recurrence  classes  with 
minimum  stochastic  potential. 
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CHAPTER  6 


Connections  Between  Cooperative  Control  and 

Potential  Games 

In  this  chapter,  we  present  a  view  of  cooperative  control  using  the  language  of  learn¬ 
ing  in  games.  We  review  the  game  theoretic  concepts  of  potential  games  and  weakly 
acyclic  games  and  demonstrate  how  several  cooperative  control  problems  such  as  con¬ 
sensus  and  dynamic  sensor  coverage  can  be  formulated  in  these  settings.  Motivated 
by  this  connection,  we  build  upon  game  theoretic  concepts  to  better  accommodate  a 
broader  class  of  cooperative  control  problems.  In  particular,  we  extend  existing  learn¬ 
ing  algorithms  to  accommodate  restricted  action  sets  caused  by  limitations  in  agent 
capabilities.  Furthermore,  we  also  introduce  a  new  class  of  games,  called  sometimes 
weakly  acyclic  games,  for  time-varying  objective  functions  and  action  sets,  and  pro¬ 
vide  distributed  algorithms  for  convergence  to  an  equilibrium.  Lastly,  we  illustrate  the 
potential  benefits  of  this  connection  on  several  cooperative  control  problems.  For  the 
consensus  problem,  we  demonstrate  that  consensus  can  be  reached  even  in  an  environ¬ 
ment  with  non-convex  obstructions.  For  the  functional  consensus  problem,  we  demon¬ 
strate  an  approach  that  will  allow  agents  to  reach  consensus  on  a  specific  consensus 
point.  For  the  dynamic  sensor  coverage  problem,  we  demonstrate  how  autonomous 
sensors  can  distribute  themselves  using  only  local  information  in  such  a  way  as  to 
maximize  the  probability  of  detecting  an  event  over  a  given  mission  space.  Lastly, 
we  demonstrate  how  the  popular  mathematical  game  of  Sudoku  can  be  modeled  as  a 
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potential  game  and  solved  using  the  learning  algorithms  discussed  in  this  chapter. 


6.1  Introduction 

Our  goals  in  this  chapter  are  to  establish  a  relationship  between  cooperative  control 
problems,  such  as  the  consensus  problem,  and  game  theoretic  methods,  and  to  demon¬ 
strate  the  effectiveness  of  utilizing  game  theoretic  approaches  for  controlling  multi¬ 
agent  systems.  The  results  presented  here  are  of  independent  interest  in  terms  of  their 
applicability  to  a  large  class  of  games.  However,  we  will  focus  on  the  consensus  prob¬ 
lem  as  the  main  illustration  of  the  approach. 

We  consider  a  discrete  time  version  of  the  consensus  problem  initiated  in  [TBA86] 
in  which  a  group  of  players  V  =  {Vi, . . . .  Vn}  seek  to  come  to  an  agreement,  or 
consensus,  upon  a  common  scalar  value1  by  repeatedly  interacting  with  one  another. 
By  reaching  consensus,  we  mean  converging  to  the  agreement  space  characterized  by 

al  —  a2  —  •  •  ■  —  0"n, 

where  a,  is  referred  to  as  the  state  of  player  V,.  Several  papers  study  different  in¬ 
teraction  models  and  analyze  the  conditions  under  whether  these  interactions  lead  to 
consensus  [BHO05,  XB04,  XB05,  OM03,  OFM07,  Mor04,  JLM03,  KBS06]. 

A  well  studied  protocol,  referred  to  here  as  the  “consensus  algorithm”,  can  be 
described  as  follows.  At  each  time  step  t  G  {0, 1, . . .  },  each  player  Vt  is  allowed  to 
interact  with  a  group  of  other  players,  who  are  referred  to  as  the  neighbors  of  player  Vt 
and  denoted  as  Nt(t).  During  an  interaction,  each  player  Vh  is  informed  of  the  current 
(or  possibly  delayed)  state  of  all  his  neighbors.  Player  Vi  then  updates  his  state  by 
forming  a  convex  combination  of  his  state  along  with  the  state  of  all  his  neighbors. 

1The  forthcoming  results  will  hold  for  multi-dimensional  consensus  as  well. 
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The  consensus  algorithm  takes  on  the  general  form 


ai(t  + 1)  =  ^2  uiAt)aAt)->  (6.i) 

where  u ;„■(£)  is  the  relative  weight  that  player  V,  places  on  the  state  of  player  V:I  at 
time  t.  The  interaction  topology  is  described  in  terms  of  a  time  varying  directed  graph 
G(V,  E{t ))  with  the  set  of  nodes  V  =  V  and  the  set  of  edges  E{t)  C  V  x  V  at  time  t. 
The  set  of  edges  is  directly  related  to  the  neighbor  sets  as  follows:  {T%,  Vt )  G  E(t)  if 
and  only  if  Vt  G  N^t).  We  will  refer  to  G(  V.  E(t ))  as  the  interaction  graph  at  time  t. 

There  has  been  extensive  research  centered  on  understanding  the  conditions  nec¬ 
essary  for  guaranteeing  the  convergence  of  all  states,  i.e.  Hindoo  ai{t)  a*,  for 
all  players  V,  G  V.  The  convergence  properties  of  the  consensus  algorithm  have 
been  studied  under  several  interaction  models  encompassing  delays  in  information  ex¬ 
change,  connectivity  issues,  varying  topologies  and  noisy  measurements. 

Surprisingly,  there  has  been  relatively  little  research  that  links  cooperative  control 
problems  to  a  branch  of  the  learning  in  games  literature  [You98]  that  emphasizes  coor¬ 
dination  games.  The  goal  of  this  chapter  is  to  better  establish  this  link  and  to  develop 
new  algorithms  for  broader  classes  of  cooperative  control  problems  as  well  as  games. 

In  Section  6.2  we  establish  a  connection  between  cooperative  control  problems 
and  potential  games.  In  Section  6.3  we  model  the  consensus  problem  as  a  potential 
game  and  present  suitable  learning  algorithms  that  guarantee  that  players  will  come 
to  a  consensus  even  in  an  environment  filled  with  non-convex  obstructions.  In  Sec¬ 
tion  6.4  we  introduce  a  new  class  of  games  called  sometimes  weakly  acyclic  games, 
which  generalize  potential  games,  and  present  simple  learning  dynamics  with  desir¬ 
able  convergence  properties.  In  Section  6.5  we  show  that  the  consensus  problem  can 
be  modeled  as  a  sometimes  weakly  acyclic  game.  In  Section  6.6  we  develop  learning 
algorithms  that  can  accommodate  group  based  decisions.  In  Section  6.7  we  model  the 
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functional  consensus  problem  as  a  potential  game  with  group  based  decisions.  In  Sec¬ 
tion  6.8  we  illustrate  the  connection  between  cooperative  control  and  potential  games 
on  the  dynamic  sensor  allocation  problem  and  also  the  mathematical  puzzle  of  Sudoku. 
Section  6.9  presents  some  final  remarks. 

6.2  Cooperative  Control  Problems  and  Potential  Games 

Cooperative  control  problems  entail  several  autonomous  players  seeking  to  collec¬ 
tively  accomplish  a  global  objective.  The  consensus  problem  is  one  example  of  a 
cooperative  control  problem,  where  the  global  objective  is  for  all  players  to  reach  con¬ 
sensus  upon  a  given  state.  The  challenge  in  cooperative  control  problems  is  designing 
local  control  laws  and/or  local  objective  functions  for  each  of  the  individual  players  so 
that  collectively  they  accomplish  the  desired  global  objective. 

One  approach  for  cooperative  control  problems  is  to  assign  each  individual  player 
a  fixed  protocol  or  policy.  This  protocol  specifies  precisely  what  each  player  should 
do  under  any  environmental  condition.  The  consensus  algorithm  set  forth  in  Equation 
(6.1)  is  an  example  of  such  a  policy  based  approach.  One  challenge  in  this  approach 
is  to  incorporate  dynamic  or  evolving  constraints  on  player  policies.  For  example, 
suppose  a  global  planner  desires  a  group  of  autonomous  players  to  physically  con¬ 
verge  to  a  central  location  in  an  environment  containing  obstructions.  The  standard 
consensus  algorithm  may  not  be  applicable  to  this  problem  since  limitations  in  control 
capabilities  caused  by  environmental  obstructions  are  not  considered.  Variations  of  the 
consensus  algorithm  could  possibly  be  designed  to  accommodate  obstructions,  but  the 
analysis  and  control  design  would  be  more  challenging. 

An  alternative,  game  theoretic  approach  to  cooperative  control  problems,  and  our 
main  interest  in  this  chapter,  is  to  assign  each  individual  player  a  local  objective  func- 
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tion.  In  this  setting,  each  player  V%  G  V  is  assigned  an  action  set  A,  and  a  local 
objective  function  Ut  \  A  —>  M,  where  A  =  Y\v,  ev  is  the  set  of  joint  actions.  An 
example  of  an  objective  function  that  will  be  studied  in  the  following  section  is 

(2_j)  .  ^  ^  ||®i 

Vj&Ni 

where  ||  ■  ||  is  any  norm,  N,  is  the  neighbor  set  of  player  Vi,  and 

a-i  =  {ai, . . . ,  a,_j,  ai+ 1, . . . ,  an}  denotes  the  collection  of  actions  of  players  other 

than  player  Vt.  With  this  notation,  we  will  frequently  express  the  joint  action  a  as 

(Vi  5  ^ — i)  • 

We  are  interested  in  analyzing  the  long  term  behavior  when  players  are  repeatedly 
allowed  to  interact  with  one  another  in  a  competitive  environment  where  each  player 
seeks  to  selfishly  maximize  his  own  objective  function.  These  interactions  will  be 
modeled  as  a  repeated  game,  in  which  a  one  stage  game  is  repeated  each  time  step  t  G 
{0, 1,2,...}.  At  every  time  step  t  >  0,  each  player  Vi  G  V  selects  an  action  a*  G  At 
seeking  to  myopically  maximize  his  expected  utility.  Since  a  player’s  utility  may  be 
adversely  affected  by  the  actions  of  other  players,  the  player  can  use  his  observations 
from  the  games  played  at  times  (0, 1,  ...,£  —  1}  to  develop  a  behavioral  model  of  the 
other  players. 

At  any  time  t  >  0,  the  learning  dynamics  specify  how  any  player  V,  processes  past 
observations  from  the  interactions  at  times  (0, 1, . . . ,  t  —  1}  to  generate  a  model  of  the 
behavior  of  the  other  players.  The  learning  dynamics  that  will  be  used  throughout  this 
chapter  are  referred  to  as  single  stage  memory  dynamics  which  have  a  structural  form 
similar  to  that  of  the  consensus  algorithm;  namely,  that  the  decision  of  any  player  Vi 
at  time  t  is  made  using  only  observations  from  the  game  played  at  time  t  —  1.  The 
learning  dynamics  need  not  be  restricted  to  single  stage  memory.  A  follow  up  study 
could  analyze  the  benefit  of  using  additional  memory  in  learning  dynamics  for  the 
consensus  problem. 
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The  challenge  of  the  control  design  for  a  game  theoretic  approach  lies  in  design¬ 
ing  the  objective  functions  and  the  learning  dynamics  such  that,  when  players  self¬ 
ishly  pursue  their  own  objectives,  they  also  collectively  accomplish  the  objective  of 
the  global  planner.  Suppose  that  the  objective  of  the  global  planner  is  captured  by  a 
potential  function  f  :  A  — >  M.  In  any  successful  multi-agent  system  each  player’s 
objective  function  should  be  appropriately  “aligned”  with  the  objective  of  the  global 
planner.  This  notion  of  utility  alignment  in  multi-agent  systems  has  a  strong  connec¬ 
tion  to  potential  games  [MS 96b].  For  convenience,  we  will  restate  the  definition  of 
potential  games  originally  defined  in  Section  2.3.2. 

Definition  6.2.1  (Potential  Games).  Player  action  sets  {A,}j=]  together  with  player 
objective  functions  {U,  :  A  — >  M}"=]  constitute  a  potential  game  if,  for  some  potential 
function  f  :  A  — >  M, 


,  Oj—i)  Ui  ( CJ'i ,  d'—i)  0(^9  >  ^—i)y 

for  every  player  Vi  G  V,  for  every  a[,  a”  G  At,  and  for  every  a-i  G  Xj^Aj. 

A  potential  game,  as  defined  above,  requires  perfect  alignment  between  the  global 
objective  and  the  players’  local  objective  functions,  meaning  that  if  a  player  unilat¬ 
erally  changed  his  action,  the  change  in  his  objective  function  would  be  equal  to  the 
change  in  the  potential  function.  There  are  weaker  notions  of  potential  games,  called 
weakly  acyclic  games,  which  will  be  discussed  later.  The  connection  between  co¬ 
operative  control  problems  and  potential  games  is  important  because  learning  algo¬ 
rithms  for  potential  games  have  been  studied  extensively  in  the  game  theory  literature 
[MS96a,  MS96b,  MS97,  MAS07b,  MAS05].  Accordingly,  if  it  is  shown  that  a  co¬ 
operative  control  problem  can  be  modeled  as  a  potential  game,  established  learning 
algorithms  with  guaranteed  asymptotic  results  could  be  used  to  tackle  the  cooperative 
control  problem  at  hand. 
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In  the  following  section  we  will  illustrate  this  opportunity  by  showing  that  the 
consensus  problem  can  be  modeled  as  a  potential  game  by  defining  players’  utilities 
appropriately. 

6.3  Consensus  Modeled  as  a  Potential  Game 

In  this  section  we  will  formulate  the  consensus  problem  as  a  potential  game.  First, 
we  establish  a  global  objective  function  that  captures  the  notion  of  consensus.  Next, 
we  show  that  local  objective  functions  can  be  assigned  to  each  player  so  that  the  re¬ 
sulting  game  is  in  fact  a  potential  game.  Finally,  we  present  a  learning  algorithm  that 
guarantees  consensus  even  in  an  environment  containing  non-convex  obstructions. 

It  turns  out  that  the  potential  game  formulation  of  the  consensus  problem  discussed 
in  this  section  requires  the  interaction  graph  to  be  time-invariant  and  undirected.  In 
Section  6.5  we  relax  these  requirements  by  formulating  the  consensus  problem  as  a 
sometimes  weakly  acyclic  game. 

6.3.1  Setup:  Consensus  Problem  with  a  Time-Invariant  and  Undirected  Inter¬ 
action  Graph 

Consider  a  consensus  problem  with  n-player  set  V  where  each  player  V,  £  V  has  a 
finite  action  set  A%.  A  player’s  action  set  could  represent  the  finite  set  of  locations  that 
a  player  could  select. 

We  will  consider  the  following  potential  function  for  the  consensus  problem 

m  :=-£  £  IhSpi,  (6.2) 

Vi&VVj&Ni 

where  Nt  C  V  is  player  TVs  time-invariant  neighbor  set.  In  the  case  where  the  interac- 
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tion  graph  induced  by  the  neighbor  sets  {A^}”=1  is  connected2,  the  potential  function 
above  achieves  the  value  of  0  if  and  only  if  the  action  profile  a  G  A  constitutes  a 
consensus,  i.e., 

0(a)  =  0  -v^  a\  =  •  •  •  =  an. 

The  goal  is  to  assign  each  player  an  objective  function  that  it  is  perfectly  aligned 
with  the  global  objective  in  (6.2).  One  approach  would  be  to  assign  each  player  the 
following  objective  function: 

Ui(a)  =  0(a). 

This  assignment  would  require  each  player  to  observe  the  decision  of  all  players  in 
order  to  evaluate  his  payoff  for  a  particular  action  choice,  which  may  be  infeasible.  An 
alternative  approach  would  be  to  assign  each  player  an  objective  function  that  captures 
the  player’s  marginal  contribution  to  the  potential  function.  For  the  consensus  problem, 
this  translates  to  each  player  being  assigned  the  objective  function 

(u*i ,  U—i)  ^  ^  || Ui  a,j\\.  (6.3) 

Vj&Ni 

Now,  each  player’s  objective  function  is  only  dependent  on  the  actions  of  his  neigh¬ 
bors.  An  objective  function  of  this  form  is  referred  to  as  Wonderful  Life  Utility;  see 
[AMS07,  WT99].  It  is  known  that  assigning  each  agent  a  Wonderful  Life  Utility  leads 
to  a  potential  game  [AMS07,  WT99];  however,  we  will  explicitly  show  this  for  the 
consensus  problem  in  the  following  claim. 

Claim  6.3.1.  Player  objective  functions  (6.3)  constitute  a  potential  game  with  the  po¬ 
tential  function  (6.2)  provided  that  the  time-invariant  interaction  graph  induced  by  the 
neighbor  sets  {Aj}"=1  is  undirected,  i.e., 


Vj  G  \'i  <  >  Vi  G  Nj. 

2  A  graph  is  connected  if  there  exists  a  path  from  any  node  to  any  other  node. 
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Proof.  Since  the  interaction  graph  is  time-invariant  and  undirected,  the  potential  func¬ 
tion  can  be  expressed  as 


*(«)  =  -  E  lk-%11-  E  E 

VjeNi  Pi /Pi  ne\j\Pi 

The  change  in  objective  of  player  V,  by  switching  from  action  a]  to  action  af  provided 
that  all  other  players  collectively  play  a_j  is 


Ufa?,  a-i)  -  Ufa\,a-i)  =  ^ 


—  a,-  -  ad  +  a,-  -  a,- 


—  5  a—i)  fi.ai  i  a— i)  ■ 


□ 

Note  that  the  above  claim  does  not  require  the  interaction  graph  to  be  connected.  There 
may  exist  other  potential  functions  and  subsequent  player  objective  functions  that  can 
accommodate  more  general  setups.  For  a  detailed  discussion  on  possible  player  objec¬ 
tive  functions  derived  from  a  given  potential  function,  see  [AMS07]. 

We  now  assume  that  the  above  game  is  repeatedly  played  at  discrete  time  steps 
t  £  {0,1,2,...}.  We  are  interested  in  determining  the  limiting  behavior  of  the  players, 
in  particular  whether  or  not  they  reach  a  consensus,  under  various  interaction  models. 
Since  the  consensus  problem  is  modeled  as  a  potential  game,  there  are  a  large  num¬ 
ber  of  learning  algorithms  available  with  guaranteed  results  [You98,  You05,  AMS07, 
MS96b,  MAS07b,  MAS05].  Most  of  the  learning  algorithms  for  potential  games  guar¬ 
antee  that  the  player  behavior  converges  to  a  Nash  equilibrium. 

It  is  straightforward  to  see  that  any  consensus  point  is  a  Nash  equilibrium  of  the 
game  characterized  by  the  player  objective  functions  (6.3).  This  is  because  a  consensus 
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point  maximizes  the  potential  function  as  well  as  the  player  objective  functions  (6.3). 
However,  the  converse  statement  is  not  true.  Let  A*  denote  the  set  of  Nash  equilibria 
and  Ac  denote  the  set  of  consensus  points.  We  know  that  Ac  C  A*  where  the  inclusion 
can  be  proper.  In  other  words,  a  Nash  equilibrium,  say  a*  G  A*,  can  be  suboptimal, 
i.e.,  4>{a*)  <  0,  and  hence  fail  to  be  a  consensus  point. 

6.3.2  A  Learning  Algorithm  for  Potential  Games  with  Suboptimal  Nash  Equi¬ 
libria 

Before  stating  the  learning  algorithm,  we  start  with  some  notation.  Let  the  strategy 
of  player  Vt  at  time  t  be  denoted  by  the  probability  distribution  p,(t)  G  A(Aj)  where 
A  (A,:)  denotes  the  set  of  probability  distributions  over  the  set  A,  .  Using  this  strategy, 
player  V,  randomly  selects  an  action  from  A,  at  time  t  according  to  p,(t). 

Consider  the  following  learning  algorithm  known  as  spatial  adaptive  play  (SAP) 
[You98].  At  each  time  t  >  0,  one  player  V,  G  V  is  randomly  chosen  (with  equal 
probability  for  each  player)  and  allowed  to  update  his  action.  All  other  players  must 
repeat  their  actions,  i.e.  a_j(f)  =  a_j(t  —  1).  At  time  t,  the  updating  player  V,t 
randomly  selects  an  action  from  At  according  to  his  strategy  pi(t )  G  A  (A,;)  where  the 
a*— th  component  pf‘  (t)  of  his  strategy  is  given  as 

=  exp{^£/i(ai,a-i(t-  1))} 

Vl  ESieA  exP{^  a-*(t  ~  !))}  ’ 

for  some  exploration  parameter  j3  >  0.  The  constant  A  determines  how  likely  player  V, 
is  to  select  a  suboptimal  action.  If  (3  =  0,  player  V,  will  select  any  action  a*  G  A,  with 
equal  probability.  As  (3  — >  oo,  player  V,  will  select  an  action  from  his  best  response 
set 

(di  G  Ai  :  Ui(di,  a-i(t  -  1))  =  max  (A(d',  d_j(t  -  1))} 

a'eA 

with  arbitrarily  high  probability. 
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In  a  repeated  potential  game  in  which  all  players  adhere  to  SAP,  the  stationary 
distribution  fi  e  A  (A)  of  the  joint  action  profiles  is  given  in  [You98]  as 

exp  {/3  0(a)} 

One  can  interpret  the  stationary  distribution  //  as  follows:  for  sufficiently  large  times 
t  >  0,  ji(a)  equals  the  probability  that  a(t)  =  a.  As  (3  t  oo,  all  the  weight  of  the 
stationary  distribution  fi  is  on  the  joint  actions  that  maximize  the  potential  function. 
In  the  potential  game  formulation  of  the  consensus  problem,  the  joint  actions  that 
maximize  the  potential  function  (6.2)  are  precisely  the  consensus  points  provided  that 
the  interaction  graph  is  connected.  Therefore,  if  all  players  update  their  actions  using 
the  learning  algorithm  SAP  with  sufficiently  large  /3,  then  the  players  will  reach  a 
consensus  asymptotically  with  arbitrarily  high  probability. 

6.3.3  A  Learning  Algorithm  for  Potential  Games  with  Suboptimal  Nash  Equi¬ 
libria  and  Restricted  Action  Sets 

One  issue  with  the  applicability  of  the  learning  algorithm  SAP  for  the  consensus  prob¬ 
lem  is  that  it  permits  any  player  to  select  any  action  in  his  action  set.  Because  of  player 
mobility  limitations,  this  may  not  be  possible.  For  example,  a  player  may  only  be  able 
to  move  to  a  position  within  a  fixed  radius  of  his  current  position.  Therefore,  we  seek 
to  modify  SAP  by  conditioning  a  player’s  action  set  on  his  previous  action.  Let  a(t  —  1) 
be  the  joint  action  at  time  t  —  1.  With  restricted  action  sets,  the  set  of  actions  available 
to  player  72,  at  time  t  is  a  function  of  his  action  at  time  t  —  1  and  will  be  denoted  as 
Ri(cii(t  —  1))  C  A,.  We  will  adopt  the  convention  that  a,  e  /?,(«,)  for  any  action 
a.i  G  Ai,  i.e.,  a  player  is  always  allowed  to  stay  with  his  previous  action. 

We  will  introduce  a  variant  of  SAP  called  binary  Restrictive  Spatial  Adaptive  Play 
(RSAP)  to  accommodate  the  notion  of  restricted  action  sets.  RSAP  can  be  described 
as  follows:  At  each  time  step  t  >  0,  one  player  V,  G  V  is  randomly  chosen  (with  equal 
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probability  for  each  player)  and  allowed  to  update  his  action.  All  other  players  must 
repeat  their  actions,  i.e.  a -fit)  =  a_*(t  —  1).  At  time  t,  the  updating  player  V%  selects 
one  trial  action  a*  randomly  from  his  allowable  set  Rfadt  —  1))  with  the  following 
probability: 


•  Pr  [di  =  di]  =  ^  for  any  at  G  Rfaft  -  1))  \  aft  -  1), 

•  Pr  [at  =  oi(t  - 1)]  =  i  -  |Rt(ay))M, 


where  Nt  denotes  the  maximum  number  of  actions  in  any  restricted  action  set  for 
player  Vt,  i.e.,  N,  :=  maxa.e_4.  \RdJk)  \-  After  player  V,  selects  a  trial  action  dj,  he 
chooses  his  action  at  time  t  as  follows: 


Pr  [aft)  =  di 


Pr  [afit)  =  di(t-  1)] 


_ exp {j3  Uj(dj,  a-j(t  -  1))} _ 

exp{/3  Ui(di,  a-i(t  -  1))}  +  exp {(3  Ui(a(t  -  1))}  ’ 

_ exp{/3  Uj(a(t  -  1))} _ 

exp{/3  Ui(di,  a-i(t  -  1))}  +  exp{/3  Ui{a(t  -  1))}  ’ 


where  (3  >  0  is  an  exploration  parameter.  Note  that  if  dj  is  selected  as  a*  (t  —  1)  then 
Pr  [adt)  =  adt  -  1)]  =  1. 


We  make  the  following  assumptions  regarding  the  restricted  action  sets. 


Assumption  6.3.1  (Reversibility).  For  any  player  Vi  G  V  and  any  action  pair  a) ,  af  e 

•A Lj, 

a-  G  Ri(a\)  a\  G  i?j(a-). 


Assumption  6.3.2  (Feasibility).  For  any  player  Vi  G  V  and  any  action  pair  a1- .  af  G 
Ai,  there  exists  a  sequence  of  actions  a®  — >  a\  — >  •  •  •  — *  af  satisfying  af  G  R,  (jp  1 ) 
for  all  k  G  {1,  2, . . . ,  n}. 

Theorem  6.3.1.  Consider  a  finite  n-player  potential  game  with  potential  function 
If  the  restricted  action  sets  satisfy  Assumptions  6.3.1  and  6.3.2,  then  RSAP  induces 
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a  Markov  process  over  the  state  space  A  where  the  unique  stationary  distribution 
p  G  A  (*4)  is  given  as 


p(a) 


exp{0  0(a)} 

Eae^ex  !>{P<l>(a)} 


,  for  any  a  G  A. 


(6.4) 


Proof.  The  proof  follows  along  the  lines  of  the  proof  of  Theorem  6.2  in  [You98].  By 
Assumptions  6.3.1  and  6.3.2  we  know  that  the  Markov  process  induced  by  RSAP  is 
irreducible  and  aperiodic;  therefore,  the  process  has  a  unique  stationary  distribution. 
Below,  we  show  that  this  unique  distribution  must  be  (6.4)  by  verifying  that  the  distri¬ 
bution  (6.4)  satisfies  the  detailed  balanced  equations 

p(a)Pab  diP)Pbat 

for  any  a,  b  G  A,  where 

Pab  :=  Pr  [a(t)  =  b\a{t  —  1)  =  a]  . 


Note  that  the  only  nontrivial  case  is  the  one  where  a  and  b  differ  by  exactly  one  player 
Vi,  that  is,  a_j  =  but  a*  f  bi  where  a*  G  Rfbf  which  also  implies  that  6*  G  Rfaf). 
Since  player  7 f  has  probability  1  /n  of  being  chosen  in  any  given  period  and  any  trial 
action  6*  G  Rfaf),  bi  f  a,  ,  has  probability  of  1/iVj  of  being  chosen,  it  follows  that 


//  (  a)  Rab 


exp{/3  0(a)} 


x 


(l/n)(l m 


exp{0  Ufb)} 


exp{0  Ufa)}  +  exp{0  Ufb)}\ 


Letting 


A  = 

we  obtain 


(  1  \  (  U/nfl/Nf 

\T,zeAexP{P  0(*)}/  VexP{^  Ufa)}  +  exp{0  Ufb)} 


p  (a)Pab  =  A  exp  {00(a)  +  (3Ufb)}. 


Since  Ufb)  —  Ufa)  =  0(6)  —  0(a),  we  have 


p(a)Pab  =  A  exp {/30(6)  +  fiUfa)}, 
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which  leads  us  to 


//  (  (l )  I^ab  ll(b)Pba- 

□ 

Note  that  if  all  players  adhere  to  the  learning  dynamics  RSAP  in  a  consensus  prob¬ 
lem  where  the  interaction  graph  is  time-invariant  and  undirected,  the  restricted  action 
sets  satisfy  Assumptions  6.3.1  and  6.3.2,  and  players  are  assigned  the  utilities  (6.3), 
then,  at  sufficiently  large  times  t,  the  players’  collective  behavior  will  maximize  the 
potential  function  (6.2)  with  arbitrarily  high  probability  provided  that  /3  is  sufficiently 
large.  Furthermore,  if  the  interaction  graph  is  connected  and  consensus  is  possible, 
meaning  (24,  fl  A2  fl  •  •  •  n  An)  ^  0,  then,  at  sufficiently  large  times  t  >  0,  the 
players’  actions  will  constitute  a  consensus  with  arbitrarily  high  probability  even  in  an 
environment  filled  with  non-convex  obstructions. 

6.3.4  Example:  Consensus  in  an  Environment  with  Non-convex  Obstructions 

Consider  the  2-D  consensus  problem  with  player  set  V  =  {Pi,  P2,  Vs,  P4}.  Each 
player  V%  has  an  action  set  Ai  =  (1,  2, ... ,  10}  x  (1,  2, ..,  10}  as  illustrated  in  Figure 
6.1.  The  arrows  represent  the  time-invariant  and  undirected  edges  of  the  connected 
interaction  graph.  The  restricted  action  sets  are  highlighted  for  players  V2  and  V\ .  At 
any  given  time,  any  player  can  have  at  most  9  possible  actions;  therefore,  TV*  =  9  for 
all  players  V,  G  V. 

We  simulated  RSAP  on  the  consensus  problem  with  the  interaction  graph,  envi¬ 
ronmental  obstruction,  and  the  initial  conditions  shown  in  Figure  6.1.  We  increase  the 
exploration  parameter  f3  as  t/200  during  player  interactions.  The  complete  action  path 
of  all  players  reaching  a  consensus  is  shown  in  Figure  6.2. 
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Figure  6.1:  Example:  Setup  of  a  Consensus  Problem  with  Restricted  Action  Sets  and 
Non-convex  Environmental  Obstructions. 

6.4  Weakly  Acyclic  and  Sometimes  Weakly  Acyclic  Games 

In  potential  games,  players’  objective  functions  must  be  perfectly  aligned  with  the  po¬ 
tential  of  the  game.  In  the  potential  game  formulation  of  the  consensus  problem,  this 
alignment  condition  required  that  the  interaction  graph  be  time-invariant  and  undi¬ 
rected.  In  this  section  we  will  seek  to  relax  this  alignment  requirement  by  allowing 
player  objective  functions  to  be  “somewhat”  aligned  with  the  potential  of  the  game. 
We  will  review  a  weaker  form  of  potential  games  called  weakly  acyclic  games  and 
introduce  a  new  class  of  games  called  sometimes  weakly  acyclic  games.  We  will  also 
present  simple  learning  dynamics  that  guarantee  convergence  to  a  universal  Nash  equi¬ 
librium,  to  be  defined  later,  in  any  sometimes  weakly  acyclic  game. 
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Figure  6.2:  Example:  Evolution  of  the  Action  Path  in  the  Consensus  Problem  with  Restricted 
Action  Sets  and  Non-convex  Environmental  Obstructions. 

6.4.1  Weakly  Acyclic  Games 

Recall  the  definition  of  a  weakly  acyclic  game  from  Section  2.3.4.  A  game  is  weakly 
acyclic  if,  for  any  a  E  A,  there  exists  a  better  reply  path  starting  at  a  and  ending  at 
some  Nash  equilibrium  [You98,  You05]. 

The  above  definition  does  not  clearly  identify  the  similarities  between  potential 
games  and  weakly  acyclic  games.  Furthermore,  using  this  definition  to  show  that  a 
given  game  G,  i.e.,  the  players,  objective  functions,  and  action  sets,  is  weakly  acyclic 
would  be  problematic.  With  these  issues  in  mind,  we  will  now  derive  an  equivalent 
definition  for  weakly  acyclic  games  that  utilizes  potential  functions. 

Lemma  6.4.1.  A  game  is  weakly  acyclic  if  and  only  if  there  exists  a  potential  function 
0  :  A  —>  M  such  that  for  any  action  a  E  A  that  is  not  a  Nash  equilibrium,  there 
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exists  a  player  P,  G  V  with  an  action  a*  G  A,  such  that  Ufa*,  a-i )  >  Ufa,,  and 
,  Q—i)  ^  Pip'it  tl—p. 

Proof.  (<=)  Select  any  action  a0  G  A  If  a0  is  not  a  Nash  equilibrium,  there  exists  a 
player  Vt  G  P  with  an  action  a*  G  A  such  that  Ufa1)  >  Ufa0)  and  ^(a1)  >  </>(a°) 
where  a1  =  (a*,  aP). 

Repeat  this  process  and  construct  a  path  a0,  a1, ... ,  an  until  it  can  no  longer  be 
extended.  Note  first  that  such  a  path  cannot  cycle  back  on  itself,  because  f  is  strictly 
increasing  along  the  path.  Since  A  is  finite,  the  path  cannot  be  extended  indefinitely. 
Hence,  the  last  element  in  this  path  must  be  a  Nash  equilibrium. 

(=^)  We  will  construct  a  potential  function  <f>  :  A  — >  M  recursively.  Select  any 
action  a0  G  A.  Since  the  game  is  weakly  acyclic,  there  exists  a  better  reply  path 
a0,  a1, . . . ,  an  where  an  is  a  Nash  equilibrium.  Let  .4°  =  {a°,  a1, . . . ,  an}.  Define  the 
(finite)  potential  function  f  over  the  set  A0  satisfying  the  following  conditions: 

4>(a°)  <  ^(a1)  <  •  •  •  <  <j)(an). 

Now  select  any  action  a0  G  A  \  A0.  There  exists  a  better  reply  path  a0,  a1, . . . ,  am 
where  am  is  a  Nash  equilibrium.  Let  A1  =  {a0,  a1, . . . ,  am}.  If  A1  D  A0  =  0  then 
define  the  potential  function  f  over  the  set  A1  satisfying  the  following  conditions: 

4>(a°)  <  p(al)  <  ■  ■  ■  <  4>(dm). 

If  A1PiA°  f  0,  then  let  k*  =  min {k  G  (1,  2, ... ,  m}  :  ak  G  -4°}.  Define  the  potential 
function  <f>  over  the  truncated  (redefined)  set  A1  =  {a0,  a\  . . . ,  satisfying  the 

following  conditions: 

4>(a°)  <  fiff)  <  ■  ■  ■  <  4>(ak*). 

Now  select  any  action  a0  G  A  \  (4l°  U  A1)  and  repeat  until  no  such  action  exists. 
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The  construction  of  the  potential  function  f  guarantees  that  for  any  action  a  G  A 
that  is  not  a  Nash  equilibrium,  there  exists  a  player  Vi  G  V  with  an  action  a*  G  Ai 
such  that  Ufa *,  a_*)  >  Ufa^  a_*)  and  <p(a*,  a-f)  >  0(a*,  a_*).  □ 


6.4.2  Learning  Dynamics  for  Weakly  Acyclic  Games 

We  will  consider  the  better  reply  with  inertia  dynamics  for  weakly  acyclic  games  an¬ 
alyzed  in  [You93,  You05].  Before  stating  the  learning  dynamics,  we  define  a  player’s 
strict  better  reply  set  for  any  action  profile  a°  G  A  as 

Bi{a°)  :=  {at  G  At  :  Ufa^aif}  >  Ufa0)}. 

The  better  reply  with  inertia  dynamics  can  be  described  as  follows.  At  each  time 
t  >  0,  each  player  V,  presumes  that  all  other  players  will  continue  to  play  their  previ¬ 
ous  actions  a_j(f  —  1).  Under  this  presumption,  each  player  Vi  G  V  selects  an  action 
according  to  the  following  strategy  at  time  t: 


Bi(a(t  -  1))  =  0  =>■  ai(t)  =  a,i(t  -  1), 

f.  f  Pr[ai(t)  =ai(t-l)]  =  a(t), 

Bi(a(t-  1))^0  ^ 

[  Pr  [ai(t)  =  a*]  =  |S.(a(t_i))| , 

for  any  action  a*  G  —  Ij)  where  ait)  G  (0, 1)  is  referred  to  as  the  player’s  inertia 

at  time  t.  According  to  these  rules,  player  'P,  will  stay  with  the  previous  action  ait  —  1) 
with  probability  a(t)  even  when  there  is  a  perceived  opportunity  for  improvement.  We 
make  the  following  standing  assumption  on  the  players’  willingness  to  optimize. 

Assumption  6.4.1.  There  exist  constants  e  and  e  such  that  for  all  time  t  >  0  and  for 


Bi(a(t  -  1))  f 


cdl  players  V,  G  V, 


0  <  e  <  aft)  <  e  <  1. 


This  assumption  implies  that  players  are  always  willing  to  optimize  with  some 
nonzero  inertia. 
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If  all  players  adhere  to  the  better  reply  with  inertia  dynamics  satisfying  Assump¬ 
tion  6.4.1,  then  the  joint  action  profiles  will  converge  to  a  Nash  equilibrium  almost 
surely  in  any  weakly  acyclic  game  [You93,  You05]. 

6.4.3  Sometimes  Weakly  Acyclic  Games 

In  the  potential  game  formulation  of  the  consensus  problem,  each  player  was  assigned 
a  time-invariant  objective  function  of  the  form  (6.3).  However,  in  the  case  of  a  time- 
varying  interaction  topology,  we  would  like  to  allow  player  objective  functions  to  be 
time-varying.  In  this  framework,  each  player  Vi  is  now  assigned  a  local  objective 
function  t/j  :  A  x  (0, 1,  2, . . .  }  — >  M.  We  will  denote  the  objective  function  of  player 
Vi  at  time  t  as  Ui(a(t),  t)  where  a(t)  is  the  action  profile  at  time  t. 

We  will  call  an  action  profile  a*  a  universal  Nash  equilibrium  if 

Ui(a*,t )  =  max  aY,  f) 

CliSA; 

for  all  times  t  >  0. 

We  will  call  a  game  sometimes  weakly  acyclic  if  there  exists  a  potential  function 
cj)  :  A  — >  M  and  a  finite  time  constant  T  such  that  for  any  time  6,  >  0  and  any  action 
profile  a0  that  is  nop  a  universal  Nash  equilibrium,  there  exists  a  time  t\  E  [t0,  t0  +  T], 
a  player  Vl  E  V,  and  an  action  a*  E  Ai  such  that  Ui(a*,  a°_ ^  ti)  >  Ui(a°,ti )  and 

a°_i)  >  0(a°). 

Note  that  a  sometimes  weakly  acyclic  game  has  at  least  one  universal  Nash  equi¬ 
librium:  namely,  an  action  profile  that  maximizes  the  potential  function  phi. 

6.4.4  Learning  Dynamics  for  Sometimes  Weakly  Acyclic  Games 

We  will  consider  the  better  reply  with  inertia  dynamics  for  games  involving  time- 
varying  objective  functions.  Before  stating  the  learning  dynamics,  we  redefine  a  player’s 
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strict  better  reply  set  for  any  action  profile  a0  G  A  and  time  t  >  0  as 


Bi(a°,t )  :=  {a.i  G  A:  :  Ufia^aP^t)  >  Ui(a°,t)}. 


The  better  reply  with  inertia  dynamics  can  be  described  as  follows.  At  each  time  t  >  0, 
each  player  Vi  presumes  that  all  other  players  will  continue  to  play  their  previous 
actions  a_j(t  —  1).  Under  this  presumption,  each  player  V,  G  V  selects  an  action 
according  to  the  following  strategy  at  time  t: 


Bi{a{t  -  1),  t)  =  0  =>■  aft)  =  aft  -  1), 

td  (  (j.  (  Pr  [aft)  =  aft- l)}=a(t), 

Bfa(t  l),f)  < 

l  Pr  M*)  =«.*]=  |B,(a(r-l),t)P 

for  any  action  a*  G  Bfa(t  —  1),  f)  where  a(t)  G  (0, 1)  is  the  player’s  inertia  at  time  t. 


Theorem  6.4.1.  Consider  an  n-player  sometimes  weakly  acyclic  game  with  finite  ac¬ 
tion  sets.  If  cdl  players  adhere  to  the  better  reply  with  inertia  dynamics  satisfying 
Assumption  6.4.1,  then  the  joint  action  profiles  will  converge  to  a  universal  Nash  equi¬ 
librium  almost  surely. 


Proof.  Let  0  :  A  — >  M  and  T  be  the  potential  function  and  time  constant  for  the 
sometimes  weakly  acyclic  game.  Let  a(t0)  =  a0  be  the  action  profile  at  time  t0.  If 
a0  is  a  universal  Nash  equilibrium,  then  a(t)  =  a0  for  all  times  t  >  t0  and  we  are 
done.  Otherwise,  there  exists  a  time  L  satisfying  (t0  +  T)  >  ti  >  t0,  a  player  Vt  G  V, 
and  an  action  a*  G  A,t  such  that  Ufa^aP^tf)  >  UfaP,tf)  and  0(a*,a°j)  >  0(a°). 
Because  of  players’  inertia,  the  action  a1  =  (a*,  affi  will  be  played  at  time  tx  with  at 
least  probability  en-1^penT. 

One  can  repeat  this  argument  to  show  that  for  any  time  t0  >  0  and  any  action 
profile  a(t0)  there  exists  a  joint  action  a*  such  that 

Pr  [a(t)  =  a* ,  Vt  >  t*]  >  e* 
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where 


t* 


to  +  T  |,4|, 

V"  \A\  -  )  ■ 


□ 


6.5  Consensus  Modeled  as  a  Sometimes  Weakly  Acyclic  Game 

Two  main  problems  arose  in  the  potential  game  formulation  of  the  consensus  problem. 
The  first  problem  was  that  a  Nash  equilibrium  was  not  necessarily  a  consensus  point 
even  when  the  interaction  graph  was  connected  and  the  environment  was  obstruction 
free.  Therefore,  we  needed  to  employ  a  stochastic  learning  algorithm  like  SAP  or 
RSAP  to  guarantee  that  the  collective  behavior  of  the  players  would  be  a  consensus 
point  with  arbitrarily  high  probability.  SAP  or  RSAP  led  to  consensus  by  introducing 
noise  into  the  decision  making  process,  meaning  that  a  player  would  occasionally  make 
a  suboptimal  choice.  The  second  problem  was  that  the  interaction  graph  needed  to  be 
time-invariant,  undirected,  and  connected  in  order  to  guarantee  consensus. 

In  this  section,  we  will  illustrate  that  by  modeling  the  consensus  problem  as  a 
sometimes  weakly  acyclic  game  one  can  effectively  alleviate  both  problems.  For 
brevity,  we  will  show  that  the  1 -dimensional  consensus  problem  with  appropriately 
designed  player  objective  functions  is  a  sometimes  weakly  acyclic  game.  However, 
one  can  easily  extend  this  to  the  multi-dimensional  case. 
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6.5.1  Setup:  Consensus  Problem  with  a  Time-Varying  and  Directed  Interaction 
Graph 

Consider  a  consensus  problem  with  a  n-player  set  V  and  a  time-varying  and  directed 
interaction  graph.  Each  player  has  a  finite  action  set  Ai  C  M  and  without  loss  of 
generalities,  we  will  assume  that  Ai  =  A2  —  —  An.  Each  player  Vi  E  V  is 

assigned  an  objective  function  Ui  :  A  x  {0, 1,  2, ...}  — >  M.  We  make  the  following 
standard  assumption  on  players’  neighbor  sets. 

Assumption  6.5.1.  Players  ’  neighbor  sets  satisfy 

Vi  E  Nft),  MVi  E  V,  t  >  0. 

Before  introducing  the  player  objective  functions,  we  define  the  following  measure 

D(a.V')  :=  max  (a,  —  a,),  (6.5) 

VuVjev 

where  V  C  V,  and  extreme  player  sets 

Vu(a)  :=  {Vi  E  V  :  a*  =  max  a.,}, 

Vj&V 

V\a)  :=  {Pj  &V:a,i  =  min  a^}, 

Vj&V 

n(a)  :=  mm{\Vu(a)\,\Vl(a)\}. 

We  also  define  the  constant  >  0  as  follows.  For  any  a1,  a2  E  El  and  any  player  sets 
V1,  V2  C  V  such  that  D(a1,V1)  f  D(a2,  V 2),  the  following  inequality  is  satisfied: 

\D(a1,V1)-D(a2,V2)\>SA. 

Consider  the  following  potential  function  f  :  A  — >  M 

0(a)  =  —D(a,  V)  +  5^(1  —  n(a)/n).  (6.6) 
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Note  that  the  potential  function  is  a  non-positive  function  that  achieves  the  value  of 
0  if  and  only  if  the  action  profile  constitutes  a  consensus.  Furthermore,  note  that  the 
potential  function  is  independent  of  the  interaction  topology. 

Rather  than  specifying  a  particular  objective  functions  as  in  (6.3),  we  will  introduce 
a  class  of  admissible  objective  functions.  To  that  end,  we  define  the  set  of  reasonable 
actions  for  player  V,  at  time  t  given  any  action  profile  a°  G  A  as 

Si(a°,t )  :=  {a*  G  Ai  :  max  a0-  >  ou  >  min  a?}. 

3  PkeNi(t) 

Note  that 

ai  G  Si(a°,t)  =>  D^a^Nft))  <  D(a°,Ni(t)). 

We  will  define  a  general  class  of  reasonable  objective  functions.  An  objective  function 
for  player  Vl  is  called  a  reasonable  objective  function  if,  for  any  time  t  >  0,  and  any 
action  profile  a  G  A,  the  better  response  set  satisfies  the  following  two  conditions: 

1.  Bi(a,t )  C  {Si(a,t),  0}, 

2.  \Si(a,  t)\  >  1  =>-  Bfa,  t)  0. 

Roughly  speaking,  these  conditions  ensure  that  a  player  will  not  value  moving  further 
away  from  his  belief  about  the  location  of  his  neighbors. 

We  will  now  relax  our  requirements  on  the  connectivity  and  time-invariance  of  the 
interaction  graph  in  the  consensus  problem.  A  common  assumption  on  the  interaction 
graph  is  connectedness  over  intervals. 

Assumption  6.5.2  (Connectedness  Over  Intervals).  There  exists  a  constant  T  >  0 
such  that  for  any  time  t  >  0,  the  interaction  graph  with  nodes  V  and  edges  E  = 
E(t)  U  •  •  •  U  E(t  +  T )  is  connected. 

Claim  6.5.1.  Reasonable  objective  functions  introduced  above  constitute  a  sometimes 
weakly  acyclic  game  with  the  potential  function  (6.6)  provided  that  the  interaction 
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graph  satisfies  Assumption  6.5.2.  Furthermore,  every  universal  Nash  equilibrium  con¬ 
stitutes  consensus. 


Proof.  It  is  easy  to  see  that  any  consensus  point  is  a  universal  Nash  equilibrium.  We 
will  show  that  if  an  action  profile  is  not  a  consensus  point,  then  there  exists  a  player 
who  can  increase  his  objective  function  as  well  as  the  potential  function  at  some  time  in 
a  fixed  time  window.  This  implies  that  every  universal  Nash  equilibrium  is  a  consensus 
point  and  furthermore  that  the  game  is  sometime  weakly  acyclic. 

Let  a0  =  a(to)  be  any  joint  action  that  is  not  a  consensus  point.  We  will  show 
that  for  some  time  t,\  E  [t0,t0  +  T]  there  exists  a  player  V,  E  V  with  an  action 
a*  E  Ai  such  that  Ufafafi^tf)  >  i)  and  0(a*,a°j)  >  <f>(a°).  To  see  this 

let  V*  be  the  extreme  player  set  with  the  least  number  of  players,  i.e.,  V*  =  V"{a{y) 
if  \Vu(a°)\  <  |'P,(a°)|  or  V*  =  Vl(a°)  if  \Vu(a°)\  >  \Vl(a°)\.  Since  the  interaction 
graph  satisfies  Assumption  6.5.23,  for  some  t\  E  [f0,  fo  +  T]  there  exists  at  least  one 
player  V,  E  V*  with  a  neighbor  V:I  E  Nfitfi  such  that  V,  f  V* .  Therefore, 

|S'j(a°,fi)|  >  1  =4>  \Bj(a° ,  ti) |  0. 

Let  a*  E  B,  (a{) ,  tfi  and  for  notional  convenience  let  a 1  =  (a*,  afif.  We  know  that 
D(a\V)  <  D(a°,  V).  If  D(a\V)  <  D(a°,  V),  then 

fi(al)  =  —D(a1,V)  +  5A(f  —  n(a1)/n), 

>  —D(a°,V)  +  5a  +  #a(1  -  ^(a1)/^), 

>  -D(a°,  V)  +  5a  +  5a(1  -  (n(a°)  +  n)/n), 

=  4>(a°). 

3Note  that  assumption  6.5.2  is  stronger  than  necessary  for  this  proof. 
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If  D(a\V)  =  D(a°,V),  then 


(^(a1)  =  —D(a0,'P)  +  5A(l—n{a1)/n)7 

>  - D(a° ,  V)  +  5a{  1  -  (^(a1)  +  1  )/n), 

>  —D(a°,  V)  +  5a{  1  -  n(a°)/n), 

=  <Ka°). 

Therefore,  a0  is  not  a  universal  Nash  equilibrium.  □ 

If  all  players  adhere  to  the  better  reply  with  inertia  dynamics  in  a  consensus  prob¬ 
lem  where  the  interaction  graph  satisfies  Assumption  6.5.2  and  the  players  are  assigned 
reasonable  objective  functions  then  the  joint  action  profile  will  converge  almost  surely 
to  a  consensus  point. 

These  results  can  easily  be  extended  to  a  multi-dimensional  consensus  problem 
with  bounded  observational  delays. 

6.5.2  Extension  to  Multi-Dimensional  Consensus 

One  can  easily  extend  the  arguments  above  to  show  that  any  A-dimcnsional  consensus 
game  is  a  sometimes  weakly  acyclic  game  by  generalizing  the  measure  and  choosing 
the  extreme  player  sets  appropriately.  An  example  of  an  acceptable  measure  is 

n 

D(a,V)  :=  ^  max  (a,  -  aj). 
k= i  Vl’  3 

where  and  di,  d2, dn  G  is  a  set  of  measure  vectors  which  span  the  com¬ 

plete  space  of  Mfc.  The  term  max-p^c-p'  (l[ {at  —  a:j )  captures  the  maximum  distance 
between  the  action  of  any  two  agents  in  the  nonempty  player  set  V  relative  to  the  mea¬ 
sure  direction  c4.  In  the  1-D  consensus  problem,  where  d\  =  1,  the  measure  reverts 
back  to  (6.5). 
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The  set  of  reasonable  actions  for  player  Vl  at  time  t  given  the  joint  action  profile  a 
is  now 

SAa,t)  =  (a'  G  Ai  :  V/c,  max  dla,i  >  dial  >  min  dlaA. 

V  '  Vj&N^t)  k  J  ~  k  1  ~  VjeNitt)  k 

The  consensus  algorithm  in  (6.1)  corresponds  to  a  specific  reasonable  utility  function. 
In  particular,  the  set  of  reasonable  actions  is  the  convex  hull  of  the  previous  actions  of 
his  neighbors,  i.e., 

Si(a,  t)  =  (a-  G  Ai  :  a'  =  ^  c o^aj,  ^  u%3  =  1,  uj13  >  0  V  Vj  G  N^t)}. 

Pj£Ni(t)  Tj&Niit) 

In  the  present  setting,  a  player’s  future  action  need  not  be  in  the  convex  hull  of  his 
neighbors’  actions. 

6.6  Group  Based  Decision  Processes  for  Potential  Games 

In  this  section,  we  analyze  the  situation  where  players  are  allowed  to  collaborate  with 
a  group  of  other  players  when  making  a  decision.  In  particular  we  extend  the  results 
of  SAP  to  accommodate  such  a  grouping  structure. 

6.6.1  Spatial  Adaptive  Play  with  Group  Based  Decisions 

Consider  a  potential  game  with  potential  function  <f>  :  A  — >  M.  We  will  now  introduce 
a  variant  of  SAP  to  accommodate  group  based  decisions.  At  each  time  t  >  0,  a  group 
of  players  G  C  T  is  randomly  chosen  according  to  a  fixed  probability  distribution 
P  G  A(2’p)  where  2P  denotes  the  set  of  subsets  of  V.  We  will  refer  to  Pc  as  the 
probability  that  group  G  will  be  chosen.  We  make  the  following  assumption  on  the 
group  probability  distribution. 

Assumption  6.6.1  (Completeness).  For  any  player  V,  G  V  there  exists  a  group  G  C 
V  such  that  Vi  G  G  and  Pq  >  0. 
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Once  a  group  is  selected,  the  group  is  unilaterally  allowed  to  alter  it’s  collective 
strategy.  All  players  not  in  the  group  must  repeat  their  action,  i.e.,  a_  G(t)  =  a.G(t- 1), 
where  a  a  is  the  action-tuple  of  all  players  in  the  group  6',  and  o_G  is  the  action-tuple 
of  all  players  not  in  the  group  G.  The  group  will  be  modeled  as  a  single  entity  with  a 
group  utility  function  UG  ■  A  — »•  M  and  a  collective  action  set  Ac  =  Y\P^c  A:-  At 
time  t,  the  updating  group  G  randomly  selects  a  collective  action  from  Ac  according 
to  the  collective  strategy  pG(t)  G  A(.AG)  where  the  aG— th  component  paG{t)  of  the 
collective  strategy  is  given  as 

aG(f)  =  exp{/3  Ui(aG,a_G(t  -  1))} 

PC  EaGe^Gexp{/3U*(dG,a_G(f-l))}’ 

for  some  exploration  parameter  (3  >  0. 

We  make  the  following  assumption  on  the  admissible  group  utility  functions: 

Assumption  6.6.2  (Group  Utility  Functions).  Group  utility  functions  must  present 
the  potential  structure  of  the  game,  meaning  that  for  any  group  G  C  V,  collective 
group  actions  a'G,  aG  G  AG,  and  a_G  G  W-p.^Ai, 

UG(aG,  a_G)  —  UG(a'G,  a_G)  =  f(aG,  a_G)  —  fWGi  O-g)- 


In  general,  group  utility  functions  need  to  preserve  this  condition.  However,  one 
can  always  assign  each  group  a  utility  that  captures  the  group’s  marginal  contribution 
to  the  potential  function,  i.e.,  a  wonderful  life  utility  as  discussed  in  Section  6.3.  This 
utility  assignment  guarantees  preservation  of  the  potential  structure  of  the  game. 

We  will  now  show  that  the  convergence  properties  of  the  learning  algorithm  SAP 
still  hold  with  group  based  decisions. 

Theorem  6.6.1.  Consider  a  finite  n-player  potential  game  with  potential  function  g(-), 
a  group  probability  distribution  P  satisfying  Assumption  6.6.1,  and  group  utility  func- 
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lions  satisfying  Assumption  6.6.2.  SAP  with  group  based  decisions  induces  a  Markov 
process  over  the  state  space  A  where  the  unique  stationary  distribution  p  G  A (*4.)  is 
given  as 


p(a)  = 


exp{0  0(a)} 
aeA  exp{/3  0(a)} 


,  for  any  a  G  A. 


Proof.  The  proof  follows  along  the  lines  of  the  proof  of  Theorem  6.2  in  [You98].  By 
Assumption  6.6.1,  the  Markov  process  induced  by  SAP  with  group  based  decisions 
is  irreducible  and  aperiodic;  therefore,  the  process  has  a  unique  stationary  distribu¬ 
tion.  Below,  we  show  that  this  unique  distribution  must  be  (6.7)  by  verifying  that  the 
distribution  (6.7)  satisfies  the  detailed  balanced  equations 

d(a)Pab  //  (  6  )  P I, a  i 

for  any  a,  b  G  A,  where 

Pab  :=  Pr  [a(t)  =  b\a{t  —  1)  =  a]  . 

Note  that  there  are  now  several  ways  to  transition  from  a  and  b  when  incorporating 
group  based  decisions.  Let  G(a,  b )  represent  the  group  of  players  with  different  actions 
in  a  and  b,  i.e., 

G(a,b)  :=  {ViEV  idi^bi}. 

Let  G(a,  b)  C  2V  be  the  complete  set  of  player  groups  for  which  the  transition  from  a 
to  b  is  possible,  i.e., 

G(a,b)  :=  (G  G  2P  :  G(a,b)  C  G}. 

Since  a  group  G  G  G(a,  b)  has  probability  PG  of  being  chosen  in  any  given  period, 
it  follows  that 

/  ]P  =  [  exp{0  0(q)}  1  r  ^ _ exp{0  UG(b)} 

l  'YhzeA  exp{0  0(^) }  J  lG^b)  G  TJ-aGeAGe^{^UG{aG,a_G)} 


137 


Letting 


\g  := 


x 


P 


G 


Y.Z&A  exP{/?  <Kz) }  J  \  T,aGeAG  exP^  ug{o.g,  a-G) }  J  ' 


we  obtain 

n(a)Pab=  ^2  AGexp  {/3<j>(a)  + /3UG(b)}. 

GeG(a,6) 

Since  I/G(&)  —  Uc{a )  =  0(6)  —  0(a )  and  G(a,  b )  =  G(b,  a),  we  have 
n(a)Pab=  ^2  AGexp{/30(5)  +  /3C/G(a)}, 

GeG(M) 


which  leads  us  to 


MPab  l-l{b)  Pba- 


□ 


6.6.2  Restricted  Spatial  Adaptive  Play  with  Group  Based  Decisions 


Extending  these  results  to  accommodate  restricted  action  sets  is  straightforward.  Let 
aft  —  1)  be  the  action  profile  at  time  t  —  1.  In  this  case,  the  restricted  action  set  for 
any  group  G  C  fat  time  t  will  be  Acif)  =  eG  Pi(ai(t  ~  !))•  We  will  state  the 
following  theorem  without  proof  to  avoid  redundancy. 


Theorem  6.6.2.  Consider  a  finite  n-player  potential  game  with  potential  function  </>(•), 
a  group  probability  distribution  P  satisfying  Assumption  6.6.1,  and  group  utility  func¬ 
tions  satisfying  Assumption  6.6.2.  If  the  restricted  action  sets  satisfy  Assumptions  6.3.1 
and  6.3.2,  then  RSAP  induces  a  Markov  process  over  the  state  space  A  where  the 
unique  stationary  distribution  //  G  A  (.A)  is  given  as 


p(a) 


exp{0  0(a)} 

Eae^expl/5^^)} 


,  for  any  a  G  A. 
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6.6.3  Constrained  Action  Sets 


The  learning  algorithms  SAP  or  RSAP  with  group  based  decisions  induced  a  Markov 
process  over  the  entire  set  A.  We  will  now  consider  the  situation  in  which  each  group’s 
action  set  is  constrained,  i.e.,  AG  C  Y\-p  eGAi-  We  will  assume  that  the  collective 
action  set  of  each  group  is  time  invariant. 


Under  this  framework,  the  learning  algorithms  SAP  or  RSAP  with  group  based 
decisions  induces  a  Markov  process  over  the  constrained  set  A  C  A  which  can  be 
characterized  as  follows:  Let  a(0)  be  the  initial  actions  of  all  players.  If  a  E  A  then 
there  exists  a  sequence  of  action  profiles  a(0)  =  a0,  a1, ...,  an  =  a  with  the  condition 
that  for  all  k  E  {1,  2, ...,  n},  ak  =  ( akGk ,  for  a  group  Gk  C  V,  where  PGk  >  0 
and  a,Qk  E  AGk ■  The  unique  stationary  distribution  /r  E  A  (,4)  is  given  as 


exp  {/3  0(a)}  t 

/x  a  =  = -  ,  for  any  a  E  A. 

z2aeAex  vifina)} 


(6.8) 


6.7  Functional  Consensus 


In  the  consensus  problem,  as  described  in  Section  6.3,  the  global  objective  was  for  all 
agents  to  reach  consensus.  In  this  section,  we  will  analyze  the  functional  consensus 
problem  where  the  goal  is  for  all  players  to  reach  a  specific  consensus  point  which  is 
typically  dependent  on  the  initial  action  of  all  players,  i.e., 

lirn  a,i(t)  =  f(a( 0)),  Wi  E  V, 

i— KX) 

where  a(0)  E  A  is  the  initial  action  of  all  players  and  /  :  A  — >  M  is  the  desired 
function.  An  example  of  such  a  function  for  an  n-player  consensus  problem  is 

f(a(  0))  =  ~Y1 

n 

Pier 

for  which  the  goal  would  be  for  all  players  to  agree  upon  the  average  of  the  initial 
actions  of  all  players.  We  will  refer  to  this  specific  functional  consensus  problem  as 
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average  consensus. 


The  consensus  algorithm  of  (6.1)  achieves  the  objective  of  average  consensus  un¬ 
der  the  condition  that  the  interaction  graph  is  connected  and  the  associated  weighting 
matrix,  Q  =  {u)ij}-pu-pje-p,  is  doubly  stochastic.  A  doubly  stochastic  matrix  is  any 
matrix  where  all  coefficients  are  nonnegative  and  all  column  sums  and  rows  sums  are 
equal  to  1 .  The  consensus  algorithm  takes  on  the  following  matrix  form 

a(t  +  1)  =  a(t). 

If  Q  is  a  doubly  stochastic  matrix,  then  for  any  time  t  >  0, 

lTa(f  +  1)  =  lTfi  a(t)  =  1  Ta(t). 

Therefore,  the  sum  of  the  actions  of  all  players  is  invariant.  Hence,  if  the  players 
achieve  consensus,  they  must  agree  upon  the  average. 

In  order  to  achieve  any  form  of  functional  consensus  it  is  imperative  that  there  exist 
cooperation  amongst  the  players.  Players  must  agree  on  how  to  alter  their  action  each 
iteration.  In  the  consensus  algorithm,  this  cooperation  is  induced  by  the  weighting  ma¬ 
trix  which  specifies  precisely  how  a  player  should  change  his  action  each  iteration.  If 
a  player  acted  selfishly  and  unilaterally  altered  his  action,  the  invariance  of  the  desired 
function  would  not  be  preserved. 

6.7.1  Setup:  Functional  Consensus  Problem  with  Group  Based  Decisions 

Consider  the  consensus  problem  with  a  time  invariant  undirected  interaction  graph  as 
described  in  Section  6.3.  To  apply  the  learning  algorithm  SAP  or  RSAP  with  group 
based  decisions  to  the  functional  consensus  problem  one  needs  to  define  both  the  group 
utility  functions  and  the  group  selection  process. 
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6.7.2  Group  Utility  Function 


Recall  the  potential  function  used  for  the  consensus  problem  with  a  time  invariant  and 
undirected  interaction  graph  analyzed  in  Section  6.3, 

0(a)  =  — (!/2)  y  lk-ajll- 

PieWjeNi 

We  will  assign  any  group  GCf  the  following  local  group  utility  function 

UG(a)  =  ~(l/2)  Y.  ^  \\ai-ajW~y  y  \\ai  —  cij\\.  (6.9) 

VieGVjeNinG  VieGVjeN^G 

An  explanation  for  the  (1/2)  is  to  avoid  double  counting  since  the  interaction  graph 
is  undirected.  We  will  now  show  that  this  group  utility  function  satisfies  Assump¬ 
tion  6.6.2.  Before  showing  this,  let  NG  denote  the  neighbors  of  group  G,  i.e.,  NG  = 
U v.eG  ^i-  The  change  in  the  potential  function  by  switching  from  a  =  (aG,  a_G)  to 
a!  =  (a'G,  a_G)  is 

0(a')-0(a)  =  -(1/2)  y  y  (Wa'i  -  a'jW  -  \\ai  -  aj\\) . 

Vi&VPj&Ni 

For  simplicity  of  notation  let  8t]  =  —  (1/2)  (|| a- [  —  a'- 1|  —  ||a*  —  1 1 ) .  The  change  in  the 

potential  can  be  expressed  as 

0(a)  -  0(a)  =  y  y 

Vi&V  Pj&Nt 

=  J2  % 

Vi&NG  Vj&Ni 

=  E  E  mE  E  s‘>+  E  E*«- 

ViEG  VjE.NiC\G  ViEG  VjENi\G  ViENq\G  VjENi 

=  E  E  ^+E  E  s‘>+  E  E  s‘j- 

'PiEG 'PjENiDG  ViEG  VjENi\G  VxENq\G  VjENiHG 

Since  the  interaction  graph  is  undirected,  we  know  that 

Sii=  J2  Si 

Pi£G  Vj&Ni\G  Vi&Nc\G  PjGNiHG 
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therefore,  we  can  conclude  that 


0(a)  -  0(a)  =  %  + 2  ^') 

Vi^G  'Pj^NiDG  VjENi\G 

=  UG(a!)  —  UG(a). 

6.7.3  Group  Selection  Process  and  Action  Constraints 

Let  a  (t  —  1)  be  the  action  profile  at  time  t  —  1.  At  time  t,  one  player  V,  is  randomly 
(uniformly)  chosen.  Rather  that  updating  his  action  unilaterally,  player  Vt  first  selects 
a  group  of  players  G  C  V  which  we  will  assume  is  the  neighbors  of  player  Vt,  i.e., 
G  =  iVj.  The  group  is  assigned  a  group  utility  function  as  in  (6.9)  and  a  constrained 
action  set  AG  C  Y\v  &G  A- 

A  central  question  is  how  can  one  constrain  the  group  action  set,  using  only  loca¬ 
tion  information,  such  as  to  preserve  the  invariance  of  the  desired  function  /.  In  this 
case,  we  will  restrict  our  attention  only  to  functions  where  “local”  preservation  equates 
to  “global”  preservation.  This  means  that  for  each  group  GCf  there  exists  a  function 
fG  :  Ag  — >  M  such  that  for  any  group  actions  a'G,  a”,  e  AG 

fcXo’c)  =  fcWc)  =>•  f(a'G,a~G)  —  f(aGA-G),  Vo_g  €  Ai- 

PAG 

Examples  of  functions  that  satisfy  this  constraint  are 

fa{a)  =  T7T\  ^  ai  f{a)  =  p777  y  ]  aii 
'  '  Vi&G  '  I  VidV 

fG ( i a )  =  max  a*  =>■  f  (a)  =  max  a.; , 

Pi£G  Pi£V 

fG(a)  =  min  a*  =>■  f(a)  =  min  ai. 

Pi£G  Vi&V 

In  each  of  these  examples,  the  structural  form  of  /  and  fG  is  equivalent.  There  may 
exist  alternative  functions  where  this  is  not  required. 
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6.7.4  Illustration 


We  will  illustrate  this  approach  by  solving  the  average  consensus  problem  on  the  ex¬ 
ample  developed  in  Section  6.3.4.  Given  the  initial  configuration,  all  players  should 
agree  upon  the  action  (5,  5).  We  will  solve  this  average  consensus  problem  using  the 
learning  algorithm  binary  RSAP  with  group  based  decisions.  However,  we  will  omit 
the  non-convex  obstruction  in  this  illustration.  This  omission  is  not  necessary,  but 
rather  convenient  for  not  having  to  verify  the  properties  of  the  constrained  action  set, 
i.e.,  is  consensus  even  possible,  and  Assumption  6.3.2  for  the  group  action  sets. 

Figure  6.3  illustrates  the  evolution  of  each  player’s  actions  using  the  stochastic 
learning  algorithm  binary  RSAP  with  group  based  decisions  and  an  increasing  (5  coef¬ 
ficient,  (3(t )  =  1.5  +  f(2/1000). 

6.8  Illustrative  Examples 

In  this  section  we  will  develop  two  examples  to  further  illustrate  the  wide  range  appli¬ 
cability  of  the  theory  developed  in  this  chapter.  The  first  problem  we  will  consider  is 
the  dynamic  sensor  allocation  problem.  Lastly,  we  will  demonstrate  how  this  theory 
can  be  used  to  solve  a  popular  mathematical  puzzle  called  Sudoku. 

6.8.1  Dynamic  Sensor  Coverage  Problem 

We  consider  the  dynamic  sensor  coverage  problem  described  in  [LC05c]  and  refer¬ 
ences  therein.  The  goal  of  the  sensor  coverage  problem  is  to  allocate  a  fixed  number 
of  sensors  across  a  given  “mission  space”  to  maximize  the  probability  of  detecting  a 
particular  event. 

We  will  divide  the  mission  space  into  a  finite  set  of  sectors  denoted  as  S.  There 
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Player  1  Player  2 


Player  3  Player  4 


Figure  6.3:  Evolution  of  Each  Player’s  Action  in  the  Average  Consensus  Problem 

exists  an  events  density  function,  or  relative  reward  function,  R(s),  defined  over  S.  We 
will  assume  that  R(s)  >  0 ,  Vs  G  S'  and  J2seS  R(s )  =  1.  In  the  application  of  enemy 
submarine  tracking,  R(s)  could  be  defined  as  the  a  priori  probability  that  an  enemy 
submarine  is  situated  in  sector  s.  The  mission  space  and  associated  reward  function 
that  we  will  use  in  this  section  is  illustrated  in  Figure  6.4. 

There  are  a  finite  number  of  autonomous  sensors  denoted  as  V  =  {'Pi, ....  Vn} 
allocated  to  the  mission  space.  Each  sensor  V,  can  position  itself  in  any  particular 
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Reward  Function  over  Mission  Space 


100 


Figure  6.4:  Illustration  of  Reward  Function  Over  Mission  Space 

sector,  i.e.,  the  action  set  of  sensor  Vh  is  A,  =  S.  Furthermore,  each  sensor  has  limited 
sensing  and  moving  capabilities.  If  an  event  occurs  in  sector  s,  the  probability  of  sensor 
Vi  detecting  the  event  given  his  current  location  a;  is  denoted  as  pi(s,  a,).  Typically, 
each  sensor  has  a  finite  sensing  radius,  ry,  where  the  probability  of  detection  obeys  the 
following: 

|| S  -  dill  <  Vi  44  Pi(s,  di)  >  0. 

An  example  of  the  sensing  and  moving  capabilities  of  a  particular  sensor  is  illustrated 
in  Figure  6.5. 

For  a  given  joint  action  profile  a  =  {ai, ...,  an},  the  joint  probability  of  detecting 
an  event  in  sector  s  is  given  by 

P(s,  a)  =  1  -  JJ  [1  -  Pi{s,  ai)}. 

Vi&V 

In  general  a  global  planner  would  like  the  sensors  to  allocate  themselves  in  such  a 
fashion  as  to  maximize  the  following  potential  function 

0(a)  =  J^P(s)P(s,a). 

seS 
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Sensor 

Coverage 


Range  Restricted 
Action  Sets 


Figure  6.5:  Illustration  of  Sensor  Coverage  and  Range  Restricted  Action  Sets  of  a  Particular 
Sensor 


One  way  to  accomplish  such  an  objective  is  to  assign  each  sensor  a  utility  function 
that  is  appropriately  aligned  with  the  global  objective  function  as  was  the  case  in  the 
consensus  problem.  One  option  is  to  just  assign  each  sensor  the  global  objective,  i.e., 

Ui{a)  =  0(a). 

Under  this  scenario,  we  have  a  potential  game  and  one  could  use  a  learning  algorithm 
like  SAP  or  RSAP  to  guarantee  that  the  sensors  allocate  themselves  in  a  configuration 
that  maximizes  the  global  objective.  However,  this  particular  choice  of  utility  functions 
require  each  sensor  to  be  knowledgable  of  the  locations  and  capabilities  of  all  other 
sensors.  To  avoid  this  requirement,  we  will  assign  each  sensor  a  Wonderful  Life  Utility 
[AMS07,  WT99].  The  utility  of  sensor  V,  given  any  action  profile  a  G  A  is  now 

Ui(a)  =  0(dj,  a_j)  -  0(a°,  a_»),  (6.10) 

where  the  action  a(-  is  defined  as  the  null  action,  which  is  equivalent  to  sensor  V, 
turning  off  all  sensing  capabilities.  The  term  0(a°,  a_,)  captures  the  value  of  the  al¬ 
location  of  all  sensors  other  than  sensor  Vi.  Therefore,  the  utility  of  sensor  V,  for  an 
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action  profile  a  is  defined  as  his  marginal  contribution  to  the  global  objective.  This 
means  that  a  sensor  now  can  evaluate  his  utility  using  only  local  information.  Further¬ 
more,  the  Wonderful  Life  Utility  assignment  preserves  the  potential  game  structure 
[AMS07,  WT99],  meaning  that  SAP  or  RSAP  can  now  be  implemented  with  the  sen¬ 
sors  using  only  local  information  to  guarantee  that  the  sensors  allocate  themselves  in 
a  desirable  configuration. 

In  the  following  simulation  we  have  the  mission  space  and  reward  function  as  illus¬ 
trated  in  Figure  6.4.  The  mission  space  is  S  —  {1,  2, ...,  100}  x  (1,  2, ...,  100}  and  the 
reward  function  satisfies  R{s)  =  1.  We  have  18  different  autonomous  sensors,  6 
with  a  sensing  radius  of  6,  6  with  a  sensing  radius  of  12,  and  6  with  a  sensing  radius  of 
18.  For  simplicity,  each  sensor  will  have  prefect  sensing  capabilities  within  its  sensing 
radius,  namely  for  any  sector  s  satisfying  ||s  —  a,]\  <  rt.  then  pi(s,  a,.)  =  1.  Each  sen¬ 
sor  is  endowed  with  the  WLU  as  expressed  in  (6.10).  All  18  sensors  originally  started 
at  the  location  (1,1)  and  each  sensor  has  range  restricted  action  sets  as  illustrated  in 
Figure  6.5.  We  ran  the  binary  RSAP  with  f3  =  0.6.  Figure  6.6  illustrates  a  snapshot 
of  the  sensors  configuration  at  the  final  iteration.  Figure  6.7  illustrates  the  evolution  of 
the  potential  function  over  the  mission. 

6.8.2  Sudoku 

Our  final  illustration  of  the  broad  applicability  of  potential  games  is  the  well  known 
mathematical  puzzle  of  Sudoku.  An  example  of  a  Sudoku  puzzle  is  shown  in  Fig¬ 
ure  6.8.  The  objective  is  to  fill  a  9x9  grid  so  that  each  column,  each  row,  and  each 
of  the  nine  3x3  boxes  contains  the  digits  from  1  to  9.  The  puzzle  setter  provides  a 
partially  completed  grid  (blue  boxes)  which  cannot  be  changed. 

We  will  now  illustrate  that  Sudoku  is  exactly  a  potential  game  when  the  players, 
action  sets,  and  utility  functions  are  designed  appropriately.  We  will  view  each  of  the 
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Evolution  of  Objective  Function  over  Mission 


Figure  6.7:  Evolution  of  Potential  Function  over  Mission 
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Figure  6.8:  Illustration  of  a  Sudoku  Puzzle 

empty  boxes  as  a  self  interested  player  Vr  with  action  set  Ai  =  {1,  2, 9}.  Each 
player  will  be  assigned  the  utility  function 

Ui(a)  :=  ^  I{ai  =  aj}+  ^2  !{ai  =  aj}+  ^2  1{ai  =  ai}, 

-Pj£N?  VjGNf  V^Nf 

where  Np,  N,f\  and  Nf  are  the  row,  column  and  box  neighbors  of  player  V,  and  /{•} 
is  the  usual  indication  function.  An  illustration  of  the  neighbor  sets  of  player  V\  is 
highlighted  in  Figure  6.9,  where  the  the  green  boxes  indicate  the  row  neighbors,  red 
boxes  indicate  the  column  neighbors,  and  yellow  boxes  indicate  the  box  neighbors. 
Note  that  in  this  framework,  unlike  with  the  consensus  problem,  each  player  Vi  is  not 
a  neighbor  of  himself. 

To  simplify  the  notation,  we  define  the  following  function:  for  each  player  Vr  and 
for  any  player  set  V  C  V,  let 

rii(a,P )  :=  ^  /{a*  =  aj. 

This  function  computes  the  number  of  players  with  the  same  action  as  player  V,  in  the 
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Figure  6.9:  Illustration  of  Neighbor  Sets  for  a  Player’s  Utility  Function  in  a  Sudoku  Puzzle 


set  V.  Using  this  function,  we  will  express  the  utility  of  player  V,  as 

Ui(a)  =  ni(a,  N?)  +  rii(a,  Nf)  +  rii(a,  Nf ). 

We  will  now  show  that  the  Sudoku  game  with  utilities  defined  as  above  is  a  poten¬ 
tial  game  with  potential  function 

0(a)  =  1/2  J^Uiia). 

Vi&V 

To  simplify  the  analysis,  we  will  break  up  the  potential  function  as 

0(a)  =  0H(a)  +  0c(a)  +  0s(a), 


where 

0R(a)  =  1/2  ^  r ii(a,7V/*), 

VidV 

0c(a)  =  1/2  ^  rii(a,  Nf), 

Vi&V 

0B(a)  =  1/2  ni(a,NtB). 

Vi&V 


150 


Let  a',  a "  G  Y  be  any  two  action  profiles  that  differ  by  a  unilateral  deviation,  i.e., 
a'  d  a"  and  a'_t  =  a"_t  for  some  player  Vi  G  V.  The  change  in  <pR(-)  is 


2(<pR(a)  -  4>R(a")) 


E  ni(a'iN, 

?) 

—  n 

i(a", 

Ni 

R), 

Vi&V 

rii(d 

,kr)- 

nt 

(a", 

N?) 

+ 

E 

rij(d , 

,NR 

)  —  rij  (i a‘ 

"  Nr 

’  3 

^eivf 

rii(a' 

,NtR)- 

rit 

(a", 

N?) 

+ 

E 

rij(a'. 

^i) 

—  rij(a". 

,P,), 

rii(a' 

,NR)- 

Hi 

(a", 

Nr) 

+ 

E 

nt  ( a  , 

Pi) 

aj  (a  , 

V,), 

rii(a' 

,NR)- 

nt 

(a", 

Nr) 

+ 

{CL  , 

N?)- 

-  rii( 

a",NtR ): 

1 

2  (rii( 

a,NR ) 

— 

rii(a",N. 

?))■ 

One  could  repeat  this  analysis  for  0C(-)  and  4>B(-)  to  show  that 

0(a)  -  4>{a”)  =  Ui{a )  -  Ui(a'). 

Therefore  the  Sudoku  game  is  in  fact  a  potential  game.  Furthermore,  the  potential 
function  is  always  nonnegative,  and  achieves  the  value  of  0  if  and  only  if  the  Sudoku 
puzzle  has  been  solved.  Therefore,  all  solutions  to  the  Sudoku  puzzles  are  in  fact  Nash 
equilibria  of  the  Sudoku  game.  However,  much  like  the  consensus  problem,  there  may 
exist  suboptimal  Nash  equilibria. 

To  solve  the  Sudoku  puzzle  we  will  use  the  learning  algorithm  SAP  as  described  in 
Section  6.3.2.  We  let  the  f3  coefficient  increase  as  f3(t )  =  f/5000.  Figure  6.10  shows 
the  evolution  of  the  potential  function  during  the  SAP  learning  process.  One  can  see 
that  the  potential  function  achieves  the  value  of  0  after  approximately  17,000  iterations 
which  means  that  the  puzzle  has  been  solved.  To  verify,  the  final  joint  action  profile  is 
illustrated  in  Figure  6.11. 

To  further  illustrate  the  applicability  of  SAP,  we  simulated  SAP  on  a  Sudoku  puzzle 
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Evolution  of  Potential  Function  for  Sudoku  Game  using  Spatial  Adaptive  Play 


Figure  6. 10:  Evolution  of  Potential  Function  in  Sudoku  Puzzle  Under  the  Learning  Algorithm 
Spatial  Adaptive  Play 
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Figure  6.11:  The  Completed  Sudoku  Puzzle 
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classified  as  very  hard.  Once  again,  a  solution  to  the  puzzle  was  found  as  illustrated  in 
Figure  6.12. 
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Figure  6.12:  Spatial  Adaptive  Play  on  a  Sudoku  Puzzle  Classified  as  Very  Hard 

It  is  important  to  note  that  while  it  took  many  iterations  to  solve  the  Sudoku  puz¬ 
zles,  the  algorithm  of  SAP  was  applied  in  its  original  form.  We  firmly  believe  that  the 
algorithm  could  be  modified  to  decrease  computation  time.  For  example,  a  player’s 
action  set  could  be  reduced  with  knowledge  of  the  board.  In  particular,  the  action  set 
of  player  Vi  in  Figure  6.9  could  initially  have  been  set  as  A\  =  {1, 2,  3,  6,  7, 8,  9}. 

6.9  Concluding  Remarks 

We  have  proposed  a  game  theoretic  approach  to  cooperative  control  by  highlighting  a 
connection  between  cooperative  control  problems  and  potential  games.  We  introduced 
a  new  class  of  games  and  enhanced  existing  learning  algorithms  to  broaden  the  ap¬ 
plicability  of  game  theoretic  methods  in  cooperative  control  setting.  We  demonstrated 
that  one  could  successfully  implement  game  theoretic  methods  on  the  cooperative  con¬ 
trol  problem  of  consensus  in  a  variety  of  settings.  While  the  main  example  used  was 
the  consensus  problem,  the  results  in  Theorems  6.3.1,  6.4.1,  and  6.6.1  and  the  notion 
of  a  sometimes  weakly  acyclic  game  is  applicable  to  a  broader  class  of  games  as  well 
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as  other  cooperative  control  problems. 
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CHAPTER  7 


Conclusions 

This  dissertation  focused  on  dealing  with  the  distributed  nature  of  decision  making  and 
information  processing  through  a  non-cooperative  game-theoretic  formulation.  The 
emphasis  was  on  simple  learning  algorithms  that  guarantee  convergence  to  a  Nash 
equilibrium. 

We  analyzed  the  long-term  behavior  of  a  large  number  of  players  in  large-scale 
games  where  players  are  limited  in  both  their  observational  and  computational  capa¬ 
bilities.  In  particular,  we  analyzed  a  version  of  JSFP  and  showed  that  it  accommodates 
inherent  player  limitations  in  information  gathering  and  processing.  Furthermore,  we 
showed  that  JSFP  has  guaranteed  convergence  to  a  pure  Nash  equilibrium  in  all  gen¬ 
eralized  ordinal  potential  games,  which  includes  but  is  not  limited  to  all  congestion 
games,  when  players  use  some  inertia  either  with  or  without  exponential  discounting 
of  the  historical  data.  Furthermore,  we  introduced  a  modification  of  the  traditional 
no-regret  algorithms  that  (i)  exponentially  discounts  the  memory  and  (ii)  brings  in  a 
notion  of  inertia  in  players’  decision  process.  We  showed  how  these  modifications  can 
lead  to  an  entire  class  of  regret  based  algorithms  that  provide  convergence  to  a  pure 
Nash  equilibrium  in  any  weakly  acyclic  game. 

The  method  of  proof  used  for  JSFP  and  the  regret  based  dynamics  relies  on  in¬ 
ertia  to  derive  a  positive  probability  of  a  single  player  seeking  to  make  an  utility  im¬ 
provement,  thereby  increasing  the  potential  function.  This  suggests  a  convergence  rate 
that  is  exponential  in  the  game  size,  i.e.,  number  of  players  and  actions.  It  should  be 
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noted  that  inertia  is  simply  a  proof  device  that  assures  convergence  for  generic  poten¬ 
tial  games.  The  proof  provides  just  one  out  of  multiple  paths  to  convergence.  The 
simulations  reflect  that  convergence  can  be  much  faster.  Indeed,  simulations  suggest 
that  convergence  is  possible  even  in  the  absence  of  inertia.  Furthermore,  recent  work 
[HM06]  suggests  that  convergence  rates  of  a  broad  class  of  distributed  learning  pro¬ 
cesses  can  be  exponential  in  the  game  size  as  well,  and  so  this  seems  to  be  a  limitation 
in  the  framework  of  distributed  learning  rather  than  any  specific  learning  process  (as 
opposed  to  centralized  algorithms  for  computing  an  equilibrium). 

We  also  analyzed  the  long-term  behavior  of  a  large  number  of  players  in  large-scale 
games  where  players  only  have  access  to  the  action  they  played  and  the  utility  they 
received.  Our  motivation  for  this  information  restriction  is  that  in  many  engineered 
systems,  the  functional  forms  of  utility  functions  are  not  available,  and  so  players  must 
adjust  their  strategies  through  an  adaptive  process  using  only  payoff  measurements.  In 
the  dynamic  processes  defined  here,  there  is  no  explicit  cooperation  or  communication 
between  players.  One  the  one  hand,  this  lack  of  explicit  coordination  offers  an  ele¬ 
ment  of  robustness  to  a  variety  of  uncertainties  in  the  strategy  adjustment  processes. 
Nonetheless,  an  interesting  future  direction  would  be  to  investigate  to  what  degree 
explicit  coordination  through  limited  communications  could  be  beneficial. 

In  this  payoff  based  setting,  players  are  no  longer  capable  of  analyzing  the  util¬ 
ity  they  would  have  received  for  alternative  action  choices  as  required  in  the  regret 
based  algorithms  and  JSFP.  We  introduced  Safe  Experimentation  dynamics  for  identi¬ 
cal  interest  games,  Simple  Experimentation  dynamics  for  weakly  acyclic  games  with 
noise-free  utility  measurements,  and  Sample  Experimentation  dynamics  for  weakly 
acyclic  games  with  noisy  utility  measurements.  For  all  three  settings,  we  have  shown 
that  for  sufficiently  large  times,  the  joint  action  taken  by  players  will  constitute  a  Nash 
equilibrium.  Furthermore,  we  have  shown  how  to  guarantee  that  a  collective  objective 
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in  a  congestion  game  is  a  (non-unique)  Nash  equilibrium. 

Lastly,  we  proposed  a  game  theoretic  approach  to  cooperative  control  by  high¬ 
lighting  a  connection  between  cooperative  control  problems  and  potential  games.  We 
introduced  a  new  class  of  games  and  enhanced  existing  learning  algorithms  to  broaden 
the  applicability  of  game  theoretic  methods  in  the  cooperative  control  setting.  We 
demonstrated  that  one  could  successfully  implement  game  theoretic  methods  on  sev¬ 
eral  cooperative  control  problems  including  consensus,  dynamic  sensor  allocation,  and 
distributing  routing  over  a  network.  Furthermore,  we  even  demonstrated  how  the 
mathematical  puzzle  of  Sudoku  can  be  modeled  as  a  potential  game  and  solved  in 
a  distributed  fashion  using  the  learning  algorithms  discussed  in  this  dissertation. 

In  summary,  this  dissertation  illustrated  a  connection  between  the  fields  of  learning 
in  games  and  cooperative  control  and  developed  several  suitable  learning  algorithms 
for  a  wide  variety  of  cooperative  control  problems.  There  remains  several  interesting 
and  challenging  directions  for  future  research. 

Equilibrium  Selection  and  Utility  Design: 

One  problem  regarding  a  game  theoretic  formulation  of  a  multi-agent  system  is  the 
existence  of  multiple  Nash  equilibria,  not  all  of  which  are  desirable  operating  condi¬ 
tions.  Is  it  possible  to  develop  a  methodology  for  designing  agent  utilities/objectives 
and  to  derive  implementable  learning  algorithms  that  guarantee  the  agents’  collective 
behavior  converges  to  a  desirable  Nash  equilibrium?  For  example,  the  potential  game 
formulation  of  the  consensus  problem  had  suboptimal  Nash  equilibria,  i.e.,  Nash  equi¬ 
libria  that  did  not  represent  consensus  points.  The  existence  of  these  suboptimal  Nash 
equilibria  required  the  use  of  a  stochastic  learning  algorithm  such  as  SAP  or  RSAP 
to  guarantee  reaching  a  desirable  Nash  equilibrium.  However,  when  we  modeled  the 
consensus  problem  as  a  sometimes  weakly  acyclic  game  and  properly  designed  the 
utilities  we  were  able  to  effectively  eliminate  these  suboptimal  Nash  equilibria.  Can 
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this  be  accomplished  for  more  general  cooperative  control  problems? 

Learning  Algorithms  for  Stochastic  Games: 

In  many  cooperative  control  problems  players  are  inherently  faced  with  a  notion  of 
state  dependent  action  sets  and  objectives.  Stochastic  games,  which  generalize  Markov 
decision  processes  to  multiple  decision  makers,  emerge  as  the  most  natural  framework 
to  study  such  cooperative  systems.  An  important  research  direction  is  understand  to 
applicability  of  Markov  games  for  cooperative  control  problems  and  to  develop  simple 
computational  learning  algorithms  for  stochastic  games  with  guaranteed  convergence 
results.  We  believe  that  the  notion  of  sometimes  weakly  acyclic  game  is  an  initial  step 
in  the  direction  or  Markov  games. 

Learning  Algorithms  with  Time  Guarantees: 

One  open  issue  with  regarding  the  applicability  of  the  learning  algorithms  dis¬ 
cussed  in  this  paper  is  time  complexity.  Roughly  speaking,  how  long  will  it  take  the 
agents  to  reach  some  form  of  a  desirable  operating  condition?  One  question  that  has 
relevance  is  whether  non-stochastic  learning  algorithms,  such  as  JSFP  and  regret  based 
algorithms,  have  computational  advantage  over  stochastic  learning  algorithms,  such  as 
SAP  or  RSAP  If  the  answer  to  this  question  is  an  affirmative,  than  the  notion  of  utility 
design  plays  an  even  more  important  role  in  the  applicability  of  these  learning  algo¬ 
rithms  for  controlling  multi-agent  systems. 
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